Patches: Browse | Submit New | Admin

[#23067] Serializing string containing non-ascii characters

Date:
2008-12-02 15:53
Priority:
3
Submitted By:
Vladimir Dobriakov (geekq)
Assigned To:
Nobody (None)
Category:
None
State:
Open
Summary:
Serializing string containing non-ascii characters

Detailed description
We had problem making calls with string parameters 
containing non ascii characters.

The hessian spec describes the string serialization:
http://hessian.caucho.com/doc/hessian-1.0-spec.xtp#string

I've attached the patch, that implements the spec
for non-ascii strings too. unpack is used for calculating string 
length.

A test case representing the problem is also provided.

Best Regards,

geekQ

http://blog.geekq.net

Add A Comment: Notepad

Please login


Followup

Message
Date: 2009-01-23 14:57
Sender: S. Leger

Hi,

If you are interested, after Vladimir commits his changes, I
can easily create a patch with the updates for UTF-8 return values
AND for the changes in bug 17631. These changes were submitted
by Larry K., a former coworker of mine and provide better mapping
of java.lang.Exception types into Ruby. It would be very nice
to get these committed!

http://rubyforge.org/tracker/index.php?func=detail&aid=17631&
amp;group_id=1270&atid=4995
Date: 2009-01-23 09:46
Sender: Vladimir Dobriakov

> Would it be okay if I add you as a committer Vladimir?

Sure, than I would just run

`patch -p0 -i fix_pack_string_properly.diff`

and increase the version number.

What is about releasing the new gem version? Would you still
do this?

Best Regards

Vladimir AKA geekQ
Date: 2009-01-23 09:18
Sender: Christer Sandberg

Hi to you both!

Sorry for not applying the patch yet. I'm totally occupied in my
current project and my mind is elsewhere right now. Would it
be okay
if I add you as a committer Vladimir? Then you can apply all the
necessary patches and everyone will be happy and at ease ;)

/Christer
Date: 2009-01-21 21:09
Sender: S. Leger

Update: I was correct, that function was mangling multi-byte
strings. I've changed the function so that it doesn't perform
the pack('C*') conversion; this works in ASCII but mangles other
UTF-8 values.

It looks a bit unintuitive because you might expect that s.length
would return the number of characters. However, it returns the
number of bytes (since strings are basically just byte arrays
in Ruby).

# +returns+:: An array, [0] = parsed string, 
#  [1] = number of bytes to slice
#  out of the data stream that were 
#  occupied by the string
def from_utf8(len = '*')
  s = @data.unpack("U#{len}").pack('U*')
  [ s, s.length ]
end


I'm gonna throw some keywords in this bug so that search engines
may pick up this thread more readily: UTF-8 Ruby Hessian mangled
Japanese Chinese Russian multibyte unicode
Date: 2009-01-21 19:06
Sender: S. Leger

Is it possible that this type of problem also affects UTF-8 strings
in return values? I am having trouble decoding an array (coming
out of a Hessian call) of UTF-8 text that contains Japanese-language
strings. I suspect that the same type of problem is occurring
in the decoding routine:

def from_utf8(len = '*')
  s = @data.unpack("U#{len}").pack('C*')
  [ s, s.unpack('C*').pack('U*').length ]
end

Also it does not appear that the original patch attached to this
bug has been applied to the SVN repo yet (as of 2009-01-21).
Date: 2008-12-05 22:55
Sender: Vladimir Dobriakov

BTW, both the original and my proposed implementation won't
work with long strings (more than 64K). For that you need
the splitting in chunks of less than 64K letters (not
bytes). As long as nobody requests this functionality we
should probably at least raise a meaningful exception like 

raise 'chunking is not supported yet' if length > 65535

Regards,

Vladimir
Date: 2008-12-04 08:34
Sender: Christer Sandberg

Hi Vladimir!

I will apply the patch and make a new release as soon as I can.
I can't fix this while at work, but I will do it this weekend
for sure.

Thanks,
Christer
Date: 2008-12-02 19:20
Sender: Vladimir Dobriakov

One additional note: the patch assumes, that the input
strings are UTF-8 encoded. We use the hessian library in a
Rails application where strings are UTF-8 encoded by
default. (English text encoded as ASCII works fine because
the bytes are the same as with UTF-8)

The patch is also tested with Cyrillic and Japanese strings.

To assume the UTF-8 encoding is the right way, I think. It
is not possible to reliably detect the encoding of a string
automatically. We either have to assume some encoding, that
is suitable for all or extend the API so the encoding is
provided by the caller. 

Attached Files:

Name Description Download
fix_pack_string_properly.diff Download

Changes:

Field Old Value Date By
File Added4197: fix_pack_string_properly.diff2008-12-02 15:53geekq