[Ironruby-core] Code Review: MutableString5

Tomas Matousek Tomas.Matousek at microsoft.com
Fri May 9 14:08:07 EDT 2008

tfpt review /shelveset:MutableString5;REDMOND\tomat

A new implementation for Ruby MutableString and Ruby regular expression wrappers.
This is just the first pass, w/o optimizations and w/o encodings (Default system encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will be removed.

Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract class that has three subclasses:

1)      StringContent

-          Holds on an instance of System.String - an immutable .NET string. This is the default representation for strings coming from CLR methods and for Ruby string literals.

-          A textual write operation on the mutable string that has this content representation will cause implicit conversion of the representation to StringBuilderContent.

-          A binary read/write operation triggers a transition to BinaryContent using the Encoding stored on the owning MutableString.

2)      StringBuilderContent

-          Holds on an instance of System.Text.StringBuilder - a mutable Unicode string.

-          A binary read/write operation transforms the content to BinaryContent representation.

-          StringBuilder is not optimal for some operations (requires unnecessary copying), we may consider to replace it with resizable char[].

3)      BinaryContent

-          A textual read/write operation transforms the content to StringBuilderContent representation.

-          List<byte> is currently used, but it doesn't fit many operations very well. We should replace it by resizable byte[].

The content representation is changed based upon operations that are performed on the mutable string. There is currently no limit on number of content type switches, so if one alternates binary and textual operations the conversion will take place for each one of them. Although this shouldn't be a common case we may consider to add some counters and keep the representation binary/textual based upon their values.

The design assumes that the nature of operations implemented by library methods is of two kinds: textual and binary. And that data that are once treated as text are not usually treated as raw binary data later. Any text in the IronRuby runtime is represented as a sequence of 16bit Unicode characters (standard .NET representation). Each binary data treated as text is converted to this representation, regardless of the encoding used for storage representation in the file. The encoding is remembered in the MutableString instance and the original representation could be always recreated. Not all Unicode characters fit into 16 bits, therefore some exotic ones are represented by multiple characters (surrogates). If there is such a character in the string, some operations (e.g. indexing) might not be precise anymore - the n-th item in the char[] isn't the n-th Unicode character in the string (there might be escape characters). We believe this impreciseness is not a real world issue and is worth performance gain and implementation simplicity.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20080509/ba9a7f1c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MutableString5.diff
Type: application/octet-stream
Size: 326215 bytes
Desc: MutableString5.diff
URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20080509/ba9a7f1c/attachment-0001.obj>

More information about the Ironruby-core mailing list