Support Requests: Browse | Submit New | Admin

[#15513] how can we prevent conversion of a UTF-8 mo file in the samizdat package to another charset?

Date:
2007-11-10 00:03
Priority:
3
Submitted By:
boud indymedia (boud)
Assigned To:
Nobody (None)
Category:
None
State:
Open
Summary:
how can we prevent conversion of a UTF-8 mo file in the samizdat package to another charset?

Detailed description
hi ruby-gettext,


PROBLEM: There is a ruby package (an RDF engine, which can also be
considered as a so-called CMS), called samizdat, which, after reading
in a mo file created from a UTF-8 po file, converts it back into what
it "thinks" is the best charset for that language. However, samizdat
is intended to work with strings entirely in UTF-8, and presents
rendered html pages through a web server stating that they are in
UTF-8. So, there is a problem when the true coding of the content of
the pages it sends are something different to what it claims the coding
is (UTF-8).

i reported this as a samizdat bug:

bug #21263: BUG: ruby-gettext-1.7.0-1 or later converts UTF-8 mo files
into another charset
https://savannah.nongnu.org/bugs/?21263

but Dmitry from samizdat has asked me to pass this on to the
ruby-gettext group, since it probably requires a deep understanding of
the ruby-gettext package to solve.


SUPPORT QUESTIONS: 

Can you suggest where this problem should be fixed?  Somewhere in
samizdat or somewhere in ruby-gettext? 

Is my suggested patch to samizdat (below) which overrides the
self.open method in the MOFile class a reasonable temporary solution?




SAMIZDAT:
main page: http://savannah.nongnu.org/
debian: http://ftp.debian.org/debian/pool/main/s/samizdat/
wiki: https://docs.indymedia.org/view/Devel/SamizdatEngine


WHERE SAMIZDAT THINKS IT SETS UTF-8:

See the samizdat_bindtextdomain method in samizdat/engine.rb to see
where samizdat tries to set utf-8 as the intended output charset. (The
whole file engine.rb is only 111 lines long, only 43 lines if you
ignore blank lines, comments and 'require' lines.)




EXAMPLES OF THE PROBLEM:

    * samizdat installed on server one: es and fr are converted to
      iso-8859-1 (but de remains in UTF-8)

    * samizdat installed on server two: converts eo to a charset
      which i haven't identified, but shows up as unknown symbols when
      interpreted as UTF-8.



CHECKS: i checked several times that the po files are in
UTF-8 and that the conversions with msgfmt have been made
correctly and installed and read.


BUG HISTORY:
i reported this to samizdat back in January earlier this year:
http://lists.gnu.org/archive/html/samizdat-devel/2007-01/msg00011.html
i thought it was solved with the samizdat-0.6.0 initial release, but this is
probably due to several coincidences related to the nature
of the bug (see analysis below) (e.g. i started the samizdat
init.d script from the command line in a UTF-8 environment).


PACKAGE VERSIONS:

    * samizdat jan 2007 (i don't remember what version): bug occurs
    * samizdat-0.6.0.20070818-1: bug occurs 
      20070818-1 is no longer the current snapshot, so here's a backup copy:
      https://docs.indymedia.org/pub/Devel/SamizdatEngine/samizdat-snapshot_070818.tar.gz

ruby-gettext version:

    * 0.8.0-1 (debian sarge!) bug does NOT occur
    * 1.7.0-1 bug occurs
    * 1.10.0 - i just tested one obvious chunk - bug still occurs


ENVIRONMENT VARIABLES:
Env variables clearly have some role to play here. i
don't seem to be able to modify these from inside of
ruby, or not in any way that affected the bug, in any
case. 

However, when setting e.g. export LC_ALL=ja_JP.UTF-8 in a shell and
then stopping/starting samizdat using the debian init.d script
/etc/init.d/samizdat and similarly stopping/starting apache (to clear
the cache), this *did* prevent charset conversion to non-UTF-8 charsets.

However, expecting sysadmin to start up samizdat from an interactive
shell with one given charset which is not convertible to latin-1 would
clearly be a bad solution, since samizdat is a service run through a
web server and, for example, should start up automatically on boot in
case a machine has to be rebooted. It should not be necessary to
require human intervention. Also, requiring any particular language +
charset such as ja_JP.UTF-8 is not a general solution.


FURTHER ANALYSIS:

Commenting out the following iconv line in gettext/mo.rb
fixes the bug, but this would create problems for other ruby
users on the system - it is clearly a bad hack:

str = Iconv.iconv(@output_charset, @charset, str).join if @charset


But it does seem to confirm the analysis of what is wrong.


ANALYSIS SUMMARY:
Based on experiments to correct the bug, it seems to me
that we could say that in some sense ruby-gettext-1.7.0
or later is "trying to be too clever",
since it decides to convert
a utf-8 charset mo file back to what is more "natural"
based on some definition (system) of what is the most
natural locale. To someone working from a terminal (shell),
this probably is a good idea. But for a process called by
a web server daemon that's probably not so good.


SOLUTIONS:
i tried solving it from a relatively high level,
i.e. in samizdat_bindtextdomain in engine.rb,
but this did not get transferred down to mo.rb.

Going up from mo.rb, one solution could be changing this
line:

@@current.charset ||= @@locale_system_module.get_charset(@@current)

to be hardwired to @@current.charset = 'utf-8', though of
course this should be an override within samizdat, not
a change in the shared copy of the ruby source code.


SOLUTION FOUND: 

This patch 
 https://savannah.nongnu.org/bugs/download.php?file_id=14092

seems to work and seems relatively compact. It's in engine.rb of
samizdat and overrides the reading in of an output_charset in the
MOFile class, resetting it to nil, so that no attempt can be made to
make any conversion, since it's unknown to what charset the conversion
should be made.





cheers
boud 

Add A Comment: Notepad

Please login


Followup

Message
Date: 2008-02-18 00:27
Sender: boud indymedia

遅参にはすみません!

Your patch -  s/locale.charset/opt[:locale].charset/ - 
works perfectly, thanks.  有り難うございました.

i've posted this on the samizdat bugs list:
https://savannah.nongnu.org/bugs/index.php?21263

and i'll close the bug in a week unless someone
objects there.
Date: 2007-11-10 02:34
Sender: Masao Mutoh

I tried to install samizdat, but it was too complex to me.
# I don't know why they install samizdat.mo by hand ...

But I found a bug of ruby-gettext's bindtextomain.
Apply the patch below to lib/gettext.rb.

diff -u -r1.32 gettext.rb
--- gettext.rb  29 Oct 2007 15:32:39 -0000      1.32
+++ gettext.rb  10 Nov 2007 02:31:43 -0000
@@ -86,7 +86,7 @@
     end
     opt[:locale] = opt[:locale] ? Locale::Object.new(opt[:locale])
: Locale.get
     opt[:charset] = output_charset if output_charset
-    locale.charset = opt[:charset] if opt[:charset]
+    opt[:locale].charset = opt[:charset] if opt[:charset]
     Locale.set_current(opt[:locale])
     target_key = bound_target
     manager = @@__textdomainmanagers[target_key]

If this doen't help you, try:

1. If samizdat is CGI application, require 'gettext/cgi', not
'gettext'
2. GetText.output_charset = "utf-8"
   (before calling all bindtexdomain)
3. set OUTPUT_CHARSET as the environment variable.
    $export OUTPUT_CHARSET=utf-8

If you won't solve your problem with these solution,
try how Locale.charset works in samizdat. I suspect it set
Locale.charset to the system locale anywhere before calling gettext
methods like _().

Attached Files:

Name Description Download
No Files Currently Attached

Changes:

No Changes Have Been Made to This Item