[fxruby-users] Unicode support in FXRuby 1.6

Gonzalo Garramuno ggarra at advancedsl.com.ar
Sat Sep 3 03:24:12 EDT 2005


That's pretty much correct.  Ruby's Unicode support is somewhat weak
compared to python or perl.
Only UTF-8 is supported.  No support for UTF-16 is available, afaik.

Basically...  here's everything you wanted to know about ruby's Unicode but
were afraid to ask....

* $KCODE can be set to support an encoding directly, but this is *NOT*
needed to have a script work with unicode.
It is just a simple shortcut so that any regex like /./ will do the right
thing.

* Without $KCODE, regexp with unicode support is available.  It is done
using /u language option, like
t =~ //u
or
Regexp.new(regex, options, 'u')
(or, alternatively,  //m which is for multi-byte -- meaning ANSI, UTF-8,
EUC, or SJIS depending on
what $KCODE is set to, albeit I believe this is now no longer needed as
setting $KCODE will alredy
adjust all regexes).

* Supporting u"" like python can be added to some extent very easily.  See:
http://redhanded.hobix.com/inspect/closingInOnUnicodeWithJcode.html
This allows you to then do:
c = u'U+00a9'  # same as \xc2\xa9

*  You can also use:
     [].pack('U*')
     "".unpack('U*')
     to pack/unpack utf-8 strings.  This allows you to easily count
characters and iterate thru them,
     without the need of jcode (which really is only needed for getting succ
to work).

* jcode.rb is kind of a ruby hack and it is incomplete.  Methods such as:
reverse, capitalize, casecmp, swapcase, all the strip functions and probably
others are not defined and will return incorrect results, depending on the
language.

* Ruby's $KCODE does not add a UTF-8 <->Latin1 encoding conversion, unlike
python's unicode strings.  So, albeit with the above, you can do:

question = u'U+00bfHabla espaU+00f1ol?'  # ¿Habla español?
puts question

similar to python's:
question = u'\u00bfHabla espa\u00f1ol?'  # ¿Habla español?
print question

You will not get the corresponding Latin1 string when you print it (unlike
python's unicode strings).

* To properly do the above, and convert Latin1<->UTF8 for printing, you
should use iconv.
    ruby -rinconv -e 'puts Iconv.iconv("UTF-8", "ISO-8859-1", "\xf1")'
   Iconv, by default, does *NOT* get installed by the One-Click Windows
installer, even thou it is supposed to be a
   standard part of ruby.
   Adding something then like:
          class UString
                 require 'iconv'
                 def to_s
                     puts Iconv.iconv("UTF-8", "ISO-8859-1", self)
                 end
           end
   will do the trick for Why's UString class.

* The ruby interpreter should have no problem reading a utf-8 .rb script
file, but you have to prefix it by calling
> ruby -Ku file.rb  (or set RUBYOPTS to -Ku, so ruby always runs with that)
Note, however, that window's notepad, when saving UTF-8 files adds a valid
albeit meaningless 3-byte BOM (byte-order sequence) at start which will not
work fine with ruby1.8 (and will also corrupt unix shebang lines on
most -all?- unixes).  This sequence is not valid utf-8 unicode, albeit it is
allowed by the standard.  Ruby, just as Unix shebangs, does not deal with
this appropiately.



More information about the fxruby-users mailing list