[Mechanize-users] Nokogiri encoding bug

Alex Young alex at blackkettle.org
Wed Jun 17 15:43:13 EDT 2009

Alex Young wrote:
> Hi there,
> That being said, a quick fix would be to simply ignore the value 
> that comes back from the parser. Since we've already got the encoding, 
> what more can the parser tell us? I don't understand that bit yet.

Just a quick follow-up. The easiest patch to do this is:

diff --git a/lib/www/mechanize/page.rb b/lib/www/mechanize/page.rb
index 1f7d884..ac6909e 100644
--- a/lib/www/mechanize/page.rb
+++ b/lib/www/mechanize/page.rb
@@ -64,7 +64,7 @@ module WWW

        def encoding
-        parser.respond_to?(:encoding) ? parser.encoding : nil
+        (parser.respond_to?(:encoding) ? parser.encoding : nil) || 

        def parser

That results in all the tests passing, except for the following:

   1) Failure:
test_another_mostly_broken_charset(TestPage) [./test/test_page.rb:32]:
<"UTF8"> expected but was

   2) Failure:
<"ISO-8859-2"> expected but was

   3) Failure:
<"ISO-8859-2"> expected but was

   4) Failure:
test_page_decoded_with_charset(TestPage) [./test/test_page.rb:99]:
<"EUC-JP"> expected but was

   5) Failure:
test_set_encoding(TestPage) [./test/test_page.rb:69]:
<"UTF-8"> expected but was

291 tests, 1502 assertions, 5 failures, 0 errors

There's clearly something screwy going on with libxml2's HTMLParser, but 
I don't have much more than that yet.

Hope this helps.


More information about the Mechanize-users mailing list