[Ironruby-core] Bytes or Characters?

Charles Oliver Nutter charles.nutter at Sun.COM
Fri Aug 8 16:49:15 EDT 2008


Tomas Matousek wrote:
> The content representation is changed based upon operations that are performed on the mutable string. There is currently no limit on number of content type switches, so if one alternates binary and textual operations the conversion will take place for each one of them. Although this shouldn't be a common case we may consider to add some counters and keep the representation binary/textual based upon their values.

Ok, so what constitutes a binary operation and what consitutes a textual 
operation? It seems like the potential for ping-ponging between the two 
representations would be a serious risk. And largely that's why we ended 
up going with a single representation, since so many APIs did pass 
String around, manipulate them, index specific characters, write them 
through some stream to somewhere else, and repeat.

If course if the ping-pong isn't bad there could probably be some 
formalized list of rules. Such a set of "binary" operations and 
"textual" operations would be useful to JRuby and MacRuby, in addition 
to IronRuby.

Here's an example we ran into, however: regexp matching against binary 
content. I know of at least one library that uses regexp to parse out a 
binary file header. How would this work under IronRuby? Also, there's 
the concern about conversion from binary to text at inopportune moments, 
which could for example corrupt binary content that could not be decoded 
into valid UTF-16 characters. In our case, long ago, we represented all 
such binary content as "plain-encoded" UTF-16 with only the low byte 
set, but that obviously wasn't a whole lot better than just using bytes, 
and it was additionally way slower.

I imagine this would also impact copy-on-write capabilities too, yes? 
Since there would be operations that could completely change the backing 
store of a string.

> The design assumes that the nature of operations implemented by library methods is of two kinds: textual and binary. And that data that are once treated as text are not usually treated as raw binary data later. Any text in the IronRuby runtime is represented as a sequence of 16bit Unicode characters (standard .NET representation). Each binary data treated as text is converted to this representation, regardless of the encoding used for storage representation in the file. The encoding is remembered in the MutableString instance and the original representation could be always recreated. Not all Unicode characters fit into 16 bits, therefore some exotic ones are represented by multiple characters (surrogates). If there is such a character in the string, some operations (e.g. indexing) might not be precise anymore - the n-th item in the char[] isn't the n-th Unicode character in the string. We believe this impreciseness is not a real world issue and is worth performance gain and
 i
>  mplementation simplicity.

I guess one obvious question here would be supporting multiple 
encodings, as in Ruby 1.9. With a byte[]-based string and JOni 
(Oniguruma port) it shouldn't be too difficult to add 1.9 string logic 
into JRuby. But it seems like it would be harder if we put in place the 
same rules you have for converting text into the platform's preferred 
format under certain circumstances.

- Charlie


More information about the Ironruby-core mailing list