<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<tt>Thanks Matt, but it does not work... <br>
<br>
I've investigated a again and :<br>
agent = WWW::Mechanize.new</tt><br>
<tt>page = agent.get("<a class="moz-txt-link-freetext" href="http://">http://</a><a moz-do-not-send="true" target="_blank"
href="http://www.google.fr">www.google.fr</a>")<br>
page.content_type gives me "text/html; charset=ISO-8859-1" which is
WRONG and should be UTF-8 - I would appreciate if somebody else could
do the same test <br>
<br>
I followed your advice and <br>
page.body.toutf8 have the same effect as
Iconv.conv('ISO-8859-1//IGNORE', 'UTF-8', page.body)<br>
and removes all accentuated characters from body<br>
</tt><br>
<tt>I really don't understand</tt><br>
<br>
le 17/07/2008 16:57, Matt White nous a dit:
<blockquote cite="mid:911480.12929.qm@web53303.mail.re2.yahoo.com"
type="cite">
<style type="text/css"><!-- DIV {margin:0px;} --></style>
<div
style="font-family: times new roman,new york,times,serif; font-size: 12pt;">
<div>Christophe,<br>
<br>
If you're doing this within Rails (which it appears you are), just use
string.toutf8. This method is part of the Kconv module which it appears
Rails includes by default.<br>
<br>
Output from my script/console:<br>
<br>
>> Kconv.toutf8 "string"<br>
=> "string"<br>
>> "string".toutf8<br>
=> "string"<br>
>> toutf8<br>
<br>
HTH,<br>
Matt White<br>
</div>
<div
style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><br>
<div style="font-family: arial,helvetica,sans-serif; font-size: 10pt;">-----
Original Message ----<br>
From: Christophe <a class="moz-txt-link-rfc2396E" href="mailto:anaema_ml@yahoo.fr"><anaema_ml@yahoo.fr></a><br>
To: <a class="moz-txt-link-abbreviated" href="mailto:mechanize-users@rubyforge.org">mechanize-users@rubyforge.org</a><br>
Sent: Thursday, July 17, 2008 3:42:23 AM<br>
Subject: [Mechanize-users] Convert data to utf-8<br>
<br>
Hello, I'm trying to find a solution to convert everything returned by <br>
mechanize to utf-8, no matter if the original page is utf-8 or iso and
I <br>
really don't know where to start from...<br>
<br>
agent = WWW::Mechanize.new { |a| a.log = <br>
Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) }<br>
one_page = agent.get("<a moz-do-not-send="true" target="_blank"
href="http://www.google.fr">www.google.fr</a>")<br>
<br>
My first problem is that one_page encoding should be utf-8 (as stated
by <br>
firefox page's properties), instead one_page.content_type is
"text/html; <br>
charset=ISO-8859-1" and displaying text content gives wrong accent <br>
conversion.<br>
Second problem, when scraping datas from a REAL ISO-8859-1 website, how
<br>
should I do to convert them to utf-8 ?<br>
<br>
Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console<br>
<br>
Thanks<br>
<br>
_______________________________________________<br>
Mechanize-users mailing list<br>
<a moz-do-not-send="true"
ymailto="mailto:Mechanize-users@rubyforge.org"
href="mailto:Mechanize-users@rubyforge.org">Mechanize-users@rubyforge.org</a><br>
<a moz-do-not-send="true"
href="http://rubyforge.org/mailman/listinfo/mechanize-users"
target="_blank">http://rubyforge.org/mailman/listinfo/mechanize-users</a><br>
</div>
</div>
</div>
<br>
<pre wrap="">
<hr size="4" width="90%">
_______________________________________________
Mechanize-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Mechanize-users@rubyforge.org">Mechanize-users@rubyforge.org</a>
<a class="moz-txt-link-freetext" href="http://rubyforge.org/mailman/listinfo/mechanize-users">http://rubyforge.org/mailman/listinfo/mechanize-users</a></pre>
</blockquote>
<br>
</body>
</html>