oui this maybe a small step for yousers but it is definitely a big step for vvvvs codebase. advancing it by about 10 years letting it finally arrive in ~2007 (yep, still some more to catch up..). so what happened? vvvv is now fully unicorn..ah unicode. from highest to lowest bit.
good to know
for most of you this will not change anything except that you don’t have to deal with UTF8 vs. ANSI in IOBoxes or on specific nodes (e.g. Text (EX9)) anymore. as from now on there is only unicorn..code. that is: inside of vvvv.
when getting strings into vvvv you may have to specify an encoding. for those special cases the Reader/Writer nodes got a (visible in inspektor only) Encodingpin. its default should work in 99% of all cases for you. for the Reader (File)node the default setting of Auto will work if the file is encoded in the current system codepage or UTF8, else you’ll have to chose the specific codepage manually. for the Writer (File)node the Auto setting means it will write files as UTF8.
the deal
doing those changes under the hood caused quite a stir in the codebase and while our tests show all green we’re still a bit cautious with merging those changes in our main alpha-branch. therefore we’re asking you to give this a ride with your patches using the latest unicorn-build from: unicorn-builds don’t forget the suitable addonpack and run your patches with it. what would be interesting to hear tested, are:
- boygrouping
- networking (osc,..)
- arduinoing (rs232,..)
- file reading/writing
- general string-heavy patches
nota bene:
- saving a patch with unicorn-characters in alphaXE2 will not let you open the same patch in older vvvv releases. if you still need to do so there is a way… bug us in the alpha-forum for more information.
- MySQL nodes had to be removed (for now)
now please give us a quick feedback in the comments if that fukcs everything up for you or you’d say it basically works. if you find a specific bugger -> alpha forum.
Comments:
Comments are no longer accepted for this post.
helo herbst, thanks for your great report again. you brought up a good point, here is the thing:
vvvv is misusing strings as arrays of bytes. this was fine/convenient in betas<=28.1 where a string in fact internally was just an array of bytes.
now vvvv strings support unicode (internally with utf16 encoding). thats great when using strings for handling text (which is what strings are actually for) but makes it more tricky when using strings to handle binary data.
so now in vvvv we have to distinguish:
ad 1) this is what reader and writer nodes got the Encodingpin for. they will default to UTF8 now, but loading old patches the setting will be “System Default” (just for backwards compatibility). nothing more to worry.
ad 2) thats your first example when loading a .jpg (binary file). you’ll have to use 8bit encodings like the Windows-1252 for example. when using that same 8bit encoding for reading and writing your example will work.
same for your second example: set the writers Encodingpin to Windows-1252 and it will work.
there are some more details..which i spare us here (unless anyone hits them).
that said it turns out it will make sense to introduce a new data type “binary” and nodes like AsByte (Value) (similar to SpellValue), AsValue (Byte) (similar to Ord), AsByte (String), AsString (Byte), … and have a Reader/Writer (Byte) and have UDP/TCP/Rs232..nodes deal with “binary” directly instead of strings. but more on that later…
@jens.a.e: in case you read this: please wait for this with adapting your node to unicorn.
sorry for not ranting earlier.. i completely agree with herbst here..
so you are saying the beauty of misusing strings as arrays of byte (and the didactical beauty of having the difference between the two only in the visualization) is for now changed into a errorprone and subtle conversion step at all input and output nodes for strings? this doesnt really sound like an advantage tp me.
and without the Byte node type (and the bunch of duplicated nodes like +(byte), the unicorn update will mostly disallow any clean binary data manipulation, right?
this surely sounds like a showstopper, right?
so please sell me some of the advantages of the whole thing - unicode with utf8 was supported for ages.. so whats the big plus?
ok, i’ll try again.
text-strings: the (big) plus is that when dealing with strings in most cases you will no longer have to be aware of the existence of encodings. unicorn is now the first-class citizen in vvvv. only when dealing with legacy-ansi strings you’ll have to deal with encodings at all. so less fiddling and conversions in general.
binary-strings: our understanding is that you’ll not notice a difference unless you’re doing strange things like loading binary data from disk with a non-8bit-ansi encoding.
so basically we’re only changing the default-encoding for the file reader/writer nodes (while keeping the default in old patches as mentioned above) as in our endless quest for the best defaults we understand that the modern default for string encoding is utf8, no longer 8-bit ansi. and file reader/writer are foremost string-handling nodes. so their defaults should please string-handlers first.
with introducing a separate first-class datatype for byte handling all byte-handlers shall be pleased as well. so kinda win-win.
hope that makes sense. also please just ride the latest unicorn and see if you find anything not still working as you’d expect.
Well, this is for sure going to take a while until it sinks into heads…
@joreg: I did what you suggested and changed the encoding to cp1252, and both examples work. Then I sat there, starring at my patch. I understand why it works, but for that you have to know all the internal and tricky details of encoding - the patch as it is (an image read and written with cp1252) is just ridiculous.
(Wouldn’t it be easier to implement (and easier to understand) a “raw bytes” encoding method in Reader than to build entirely new nodes? I don’t think that a change that is meant to make life easier is supposed to introduce dozens of new nodes (an entire new node class!).)
And I would like to suggest to make the encoding pin visible by default. Asking yourself “why the heck can’t I store downloaded images anymore?” isn’t great, you would basically force everyone to google and find this post here with explanations (be it “use Reader (bytes)” or “set the hidden encoding pin to cp1252”).
helo herbst, please check unicorn-in-alpha-builds new-datatype-raw
you’ll notice that Encodingpin for Reader/Writer (String) is now actually visible. but for dealing with binary files you’d rather use Reader/Writer (Raw) and not deal with encodings at all.
we are quite certain that the new raw-datatype will make it more intuitive to handle binary data. so please give it a try and report your findings on whats now there.
Hi Joreg, as you said " that’s fucks everything up" for me !
I really use a lot MySQL nodes .. for how long these node will be removed from vvvv ? In other terms, how long will I have to stay to the last “non unicorn” version of vvvv ? :(
I also used a lot RS232 node, so will try to make some test..
But despite this really bad news, it’s great to see that vvvv is in a constant evolution.
are the database nodes broken, too?
and, what is the exact reason for that issue?
@u7angel: please test and report
@roger: as bjoern said. they’ll have to be rewritten which shouldn’t be too much of an issue seeing all the other database nodes implemented by vuxuser…
@sebl: they shouldn’t be. please report if you find troubles
first of all, i should express how happy this change does make me :))))))) well, despite all the f**kups that happen, the world deserves unicode. so, yes, me happy about this.
@u7angel & @joreg: i’ll look into the firmata stuff asap. have to find a board first…. i think it’s a simple fix. a fast patch should only envolve to throw in the suggested Encodingnode in the module. Nevertheless the unicode support will be implemented into the plugin.
question: is there a way in the plugin SDK to ask for the standard encoding of the plugin host? might help other plugins too to keep them cross. Even if the old version do not provide this interface, it could be wrapped in a try/catch.
Tested it on lots of strings (> 50000 in a spread, different languages, including french). Autodetect encoding didn’t work (but they are utf8 without BOM, which is hard - but Notepad++ gets it right by heuristics). But after I selected utf8, all IOBoxes with strings showed the french characters correct (had to manually select encoding for them before).
BUT: just found a bug, I think. Reading in an XML file with encoding set to utf8 in the Reader, the XML nodes do not work anymore. If I set it to auto, they do, but comparing the strings extracted from the XML (now in cp1252 or so) with the lot of 50k utf8 strings fails for all the french characters. Which means that to get it to work again I had to switch all Readers to Auto => all the comparing works, but the string IOBoxes do not show french characters correctly, as everything isn’t utf8 internally.
So, UTF8 works great, but breaks the newly-introduced XML nodes. At least with UF8 without BOM. I made two patches, one in 28-2 and nearly the same in 28-3-unicorn, so see for yourself.
http://dl.dropbox.com/u/36620736/WEB/VVVV/Encoding_XML_Bug.7z
And a request: please show what encoding Reader thought the file was. Makes debugging easier, I think, as “auto” does not work for everything (e.g. no BOM, which is the official standard for utf8 as far as I know).
thank you for this great report herbst. new build is coming up. check back in about 20 minutes and test latest unicorn build. i’ve also added a “Selected Auto Encoding” output pin, which shows you the encoding selected by the system if input encoding is set to auto.
the issues you describe regarding the xml nodes wasn’t really a bug. in case of utf-8 the reader “thought” there must be a BOM and simply removed the first 3 bytes from the string. that was wrong of course and should be fixed now.
Great, the XML reading is working now with the latest unicorn build.
One thing I’m definitely missing though: any way to tell the writer (or some other text-out-node node) “don’t touch anything, leave every single bit as it was”. Or, when does that happen? If I explicitly set the writer to utf8, as that’s the internal format? As I understand, it crushes everything it does not understand as utf8 to some “unknown” character.
To clarify: what if I want to read some binary file with the Reader, and write it back to disk after some operations? Or, much more probable, want to fetch some images from the web and save them? Or fetch some binary data and save it? Reader needs a “raw” encoding, or an “Encode” toggle.
I attached two new scenes to test - one reads & writes an image, the other one fetches an image from the web. I wasn’t able to save either one back to disk in the correct format, it was always garbled (and definitely no image anymore).
http://dl.dropbox.com/u/36620736/WEB/VVVV/Encoding_images.7z
Please do not assume that everyone uses Reader & Writer for text files only :).
Edit: I think selecting the encoding at input/output and not where the content is used/generated may be a design flaw. In current beta, you read a file, you write a file, and every single byte simply goes through. If you want some specific encoding, you tell the IOBox what you want, and you can switch between encodings with Convert before you send it to disk, to web or to some device. Power where power is needed. And now, you are forced to select the encoding at the in and outs and hope that nothing gets destroyed inbetween - what does + (String) do internally now, for example? Does it convert every character to utf8? What if I want to construct commands for devices, byte by byte, as possible before? What if I want to use ord and char? And so on, and so on… I hope you will answer all of these questions before finally releasing, because it WILL cause a lot of confusion if strings do not behave like “appending bytes to some other bytes” anymore.