[libxml-devel] [ libxml-Bugs-21658 ] failure to parse and obey encoding when creating document

noreply at rubyforge.org noreply at rubyforge.org
Mon Nov 24 14:12:04 EST 2008


Bugs item #21658, was opened at 2008-08-24 11:01
You can respond by visiting: 
http://rubyforge.org/tracker/?func=detail&atid=1971&aid=21658&group_id=494

Category: None
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 3
Submitted By: Nobody (None)
>Assigned to: Charlie Savage (cfis)
Summary: failure to parse and obey encoding when creating document

Initial Comment:
The following appeared on comp.ruby.lang:

===== quoted material follows

I have an XML request,
using the following code as an example:

require "rubygems"
require "xml/libxml"

movie = "sin+city"
search_url = 'http://www.movie-xml.com/interfaces/getmovie.php?moviename='
url = search_url+movie
doc = XML::Document.file(url)

Here's the response I get:

Input is not proper UTF-8, indicate encoding !

The source XML has an encoding declared as such:

<?xml version="1.0" encoding="ISO-8859-1"?>

===== end quoted material

Tested and confirmed, plus I tried the same operation with REXML and there was no problem. It looks like we are not examining the encoding attribute up front and obeying it when parsing the body of the doc.

----------------------------------------------------------------------

>Comment By: Charlie Savage (cfis)
Date: 2008-11-24 12:12

Message:
No response - closing the issue.

----------------------------------------------------------------------

Comment By: Charlie Savage (cfis)
Date: 2008-11-15 17:47

Message:
This url is no longer valid.  Do you have another test case?

----------------------------------------------------------------------

Comment By: Eric Ivancich (ivancich)
Date: 2008-08-24 17:29

Message:
Twice in the XML data retrieved from the URL generated in the detailed description, the word "verg?enza" appears, where the "?" has hex code 0xFC that encodes a lower case "u" with umlaut in ISO-8859-1.  0xFC cannot appear in UTF-8 data due to RFC-3629.

So that adds further evidence that it's trying to parse the file as UTF-8 rather than ISO-8859-1.

----------------------------------------------------------------------

Comment By: Erik Hollensbe (erikh)
Date: 2008-08-24 14:24

Message:
>From this thread on ruby-talk: http://www.ruby-forum.com/topic/163524

----------------------------------------------------------------------

You can respond by visiting: 
http://rubyforge.org/tracker/?func=detail&atid=1971&aid=21658&group_id=494


More information about the libxml-devel mailing list