From mikemondragon at gmail.com Mon Feb 4 15:06:50 2008
From: mikemondragon at gmail.com (Mike Mondragon)
Date: Mon, 4 Feb 2008 12:06:50 -0800
Subject: [Mechanize-users] Weird error downloading a gzip'ed file
In-Reply-To:
References:
Message-ID: <967d3b9a0802041206m3025279euff337e9513d42911@mail.gmail.com>
On 11/12/07, gmoraes wrote:
> Hi all,
>
> I've been using mechanize for a while and it rocks. Docs are pretty clear
> and so far I've been able to do it on my own.
> However, I'm stuck in a weird situation in a script to download my contact
> list from hotmail.
> I've used Firebug to check all urls, and tested it by hand while logged in
> via browser.
> Even in the script everything works well until the last 'agent.get_file',
> which gets stuck with a weird error:
>
> ------ snip ------
> $ ruby msn-scrap.rb
> # URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
> >
> "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx"
> Err: unexpected end of file
> Trace:
> /usr/lib/ruby/1.8/mechanize.rb:372:in `read'
> /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page'
> /usr/lib/ruby/1.8/net/http.rb:1050:in `request'
> /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body'
> /usr/lib/ruby/1.8/net/http.rb:1049:in `request'
> /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page'
> /usr/lib/ruby/1.8/net/http.rb:543:in `start'
> /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page'
> /usr/lib/ruby/1.8/mechanize.rb:139:in `get'
> /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file'
> msn-scrap.rb:32
I just wanted to follow up that I experienced this same issue when
scraping Hotmail. There is a form on
/mail/options.aspx?subsection=26&n=XXXXX that when posted will return
a CSV file of your contacts, the response header has attachment with a
content type of text/csv. But when you mimic the interaction with
Mechanize the underlying Net::HTTP will read a number of bytes then
unexpectedly raise an eof exception.
Anyway, Hotmail seems to pretty up their own CSV as HTML on this page:
/mail/PrintShell.aspx?type=contact
and Mechanize can fetch that without any problems and then you can use
Hpricot to get at contact attributes. That is how the Blackbook Gem
is handling Hotmail.
Blackbook Gem: http://rubyforge.org/frs/?group_id=4311
--
Mike Mondragon
Work> http://sas.quat.ch/
Blog> http://blog.mondragon.cc/
From aaron at tenderlovemaking.com Mon Feb 4 15:28:26 2008
From: aaron at tenderlovemaking.com (Aaron Patterson)
Date: Mon, 4 Feb 2008 12:28:26 -0800
Subject: [Mechanize-users] Weird error downloading a gzip'ed file
In-Reply-To: <967d3b9a0802041206m3025279euff337e9513d42911@mail.gmail.com>
References:
<967d3b9a0802041206m3025279euff337e9513d42911@mail.gmail.com>
Message-ID: <20080204202826.GA7282@mac-mini.lan>
On Mon, Feb 04, 2008 at 12:06:50PM -0800, Mike Mondragon wrote:
> On 11/12/07, gmoraes wrote:
> > Hi all,
> >
> > I've been using mechanize for a while and it rocks. Docs are pretty clear
> > and so far I've been able to do it on my own.
> > However, I'm stuck in a weird situation in a script to download my contact
> > list from hotmail.
> > I've used Firebug to check all urls, and tested it by hand while logged in
> > via browser.
> > Even in the script everything works well until the last 'agent.get_file',
> > which gets stuck with a weird error:
> >
> > ------ snip ------
> > $ ruby msn-scrap.rb
> > # > URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
> > >
> > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx"
> > Err: unexpected end of file
> > Trace:
> > /usr/lib/ruby/1.8/mechanize.rb:372:in `read'
> > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page'
> > /usr/lib/ruby/1.8/net/http.rb:1050:in `request'
> > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body'
> > /usr/lib/ruby/1.8/net/http.rb:1049:in `request'
> > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page'
> > /usr/lib/ruby/1.8/net/http.rb:543:in `start'
> > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page'
> > /usr/lib/ruby/1.8/mechanize.rb:139:in `get'
> > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file'
> > msn-scrap.rb:32
>
> I just wanted to follow up that I experienced this same issue when
> scraping Hotmail. There is a form on
> /mail/options.aspx?subsection=26&n=XXXXX that when posted will return
> a CSV file of your contacts, the response header has attachment with a
> content type of text/csv. But when you mimic the interaction with
> Mechanize the underlying Net::HTTP will read a number of bytes then
> unexpectedly raise an eof exception.
>
> Anyway, Hotmail seems to pretty up their own CSV as HTML on this page:
> /mail/PrintShell.aspx?type=contact
> and Mechanize can fetch that without any problems and then you can use
> Hpricot to get at contact attributes. That is how the Blackbook Gem
> is handling Hotmail.
>
> Blackbook Gem: http://rubyforge.org/frs/?group_id=4311
I think I've finally tracked down this error (thanks to postmodern).
Its a bug in net/http. I've submitted a patch for ruby here:
http://rubyforge.org/tracker/index.php?func=detail&aid=17778&group_id=426&atid=1700
And I'll add a monkey patch to mechanize to fix this in 0.7.1.
--
Aaron Patterson
http://tenderlovemaking.com/
From doreper at gmail.com Tue Feb 5 18:34:44 2008
From: doreper at gmail.com (Dave Oreper)
Date: Tue, 5 Feb 2008 18:34:44 -0500
Subject: [Mechanize-users] Why does ClientForm check if self._current_form
== self._global_form when closing select tags?
Message-ID:
Mechanize: __version__ = (0, 1, 8, "b", None) # 0.1.8b
Why does ClientForm.py return instead of closing the option and select tag
when self._current_form == self._global_form?
ClientForm.py:
def end_select(self):
debug("")
if self._current_form is self._global_form:
return
if self._option is not None:
self._end_option()
self._select = None
When there is more than one select statement outside of a form (legal html),
ClientForm (improperly) raises the nested SELECTs error because the first
select is never closed. This is due to the fact that during __init__ of
_AbstractFormParser, self._current_form = self._global_form = self.forms[0].
When end_select is called, self._current_form == self._global_form because
there are no forms. We return instead of closing the option tag and the
select tag. This is a regression from __version__ = (0, 0, 12, "a", None) #
0.0.12a.
Under what circumstances would we want to return instead of closing the tag?
The parser works correctly if end_select is changed to:
ClientForm.py:
def end_select(self):
debug("")
if self._option is not None:
self._end_option()
self._select = None
Example form:
NOTE: This form contains valid html however upon parse, nested SELECTs error
is raised.
-DaveO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20080205/f967eb3e/attachment-0001.html
From aaron at tenderlovemaking.com Tue Feb 5 23:16:48 2008
From: aaron at tenderlovemaking.com (Aaron Patterson)
Date: Tue, 5 Feb 2008 20:16:48 -0800
Subject: [Mechanize-users] Why does ClientForm check
if self._current_form == self._global_form when closing select tags?
In-Reply-To:
References:
Message-ID: <20080206041648.GA23483@mac-mini.lan>
On Tue, Feb 05, 2008 at 06:34:44PM -0500, Dave Oreper wrote:
> Mechanize: __version__ = (0, 1, 8, "b", None) # 0.1.8b
>
> Why does ClientForm.py return instead of closing the option and select tag
> when self._current_form == self._global_form?
>
> ClientForm.py:
> def end_select(self):
> debug("")
> if self._current_form is self._global_form:
> return
> if self._option is not None:
> self._end_option()
> self._select = None
>
> When there is more than one select statement outside of a form (legal html),
> ClientForm (improperly) raises the nested SELECTs error because the first
> select is never closed. This is due to the fact that during __init__ of
> _AbstractFormParser, self._current_form = self._global_form = self.forms[0].
> When end_select is called, self._current_form == self._global_form because
> there are no forms. We return instead of closing the option tag and the
> select tag. This is a regression from __version__ = (0, 0, 12, "a", None) #
> 0.0.12a.
>
> Under what circumstances would we want to return instead of closing the tag?
>
> The parser works correctly if end_select is changed to:
> ClientForm.py:
> def end_select(self):
> debug("")
> if self._option is not None:
> self._end_option()
> self._select = None
>
>
>
> Example form:
> NOTE: This form contains valid html however upon parse, nested SELECTs error
> is raised.
>
>
>
>
>
>
>
>
>
>
>
>
Try 'gem install mechanize'
--
Aaron Patterson
http://tenderlovemaking.com/
From wynst.uei at gmail.com Mon Feb 11 00:40:14 2008
From: wynst.uei at gmail.com (Cho Cogo)
Date: Mon, 11 Feb 2008 12:40:14 +0700
Subject: [Mechanize-users] [ask] how to make mechanize download using
Net::SSH SOCKS5
Message-ID:
I have a local ssh session, how can i tell mechanize to use that
connection, instead standard net::http. I've tried to setup the proxy
but it wouldnt work.
Thanks.
WyNst
From aaron at tenderlovemaking.com Mon Feb 11 14:03:07 2008
From: aaron at tenderlovemaking.com (Aaron Patterson)
Date: Mon, 11 Feb 2008 11:03:07 -0800
Subject: [Mechanize-users] [ask] how to make mechanize download
using Net::SSH SOCKS5
In-Reply-To:
References:
Message-ID: <20080211190306.GA15112@mac-mini.lan>
On Mon, Feb 11, 2008 at 12:40:14PM +0700, Cho Cogo wrote:
> I have a local ssh session, how can i tell mechanize to use that
> connection, instead standard net::http. I've tried to setup the proxy
> but it wouldnt work.
I don't understand. Are you using ssh to forward a port?
--
Aaron Patterson
http://tenderlovemaking.com/
From mjerome at gmail.com Fri Feb 15 04:26:30 2008
From: mjerome at gmail.com (Michael Jerome)
Date: Fri, 15 Feb 2008 09:26:30 +0000
Subject: [Mechanize-users] Performance - can anyone suggest why this is slow?
Message-ID:
Hello All
This is my first post to mechanize-users at rubyforge.org so be gentle!
Please take a look at the code and output below. The code uses
Mechanize to load two web pages; the output shows the time before and
after loading each page. Both pages load successfully but while
http://www.google.com takes less than a second,
http://tinyurl.com/7pqpx is taking about THREE MINUTES!!!
I realise that the second page is "heavier" than the first and I
realise there are HTTP and META redirects but something isn't right if
it's taking three minutes ... Firefox loads http://tinyurl.com/7pqpx
in seconds.
Can anyone suggest what's causing it to be so slow? Is it possible
the site is detecting I'm using Mechanize and slowing it down
intentionally? Any help or advice is really appreciated.
Regards
Mike
################# This code:
["http://www.google.com", "http://tinyurl.com/7pqpx"].each do |url|
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Windows IE 7'
agent.follow_meta_refresh = true
puts "Started getting #{url} at #{Time.now}"
page = agent.get( url )
puts "Completed getting #{url} at #{Time.now}"
end
################# Gives this output:
Started getting http://www.google.com at Fri Feb 15 09:10:15
Completed getting http://www.google.com at Fri Feb 15 09:10:15
Started getting http://tinyurl.com/7pqpx at Fri Feb 15 09:10:15
Completed getting http://tinyurl.com/7pqpx at Fri Feb 15 09:13:16
#################
From wynst.uei at gmail.com Tue Feb 26 06:23:46 2008
From: wynst.uei at gmail.com (Cho Cogo)
Date: Tue, 26 Feb 2008 18:23:46 +0700
Subject: [Mechanize-users] [ask] how to make mechanize download using
Net::SSH SOCKS5
In-Reply-To:
References:
Message-ID:
On 2/11/08, Cho Cogo wrote:
> I have a local ssh session, how can i tell mechanize to use that
> connection, instead standard net::http. I've tried to setup the proxy
> but it wouldnt work.
Hi, after a bit googling I found the solution..
------
For a SOCKS5 proxy, get SOCKSify-Ruby:
git clone http://cthulhu.c3d2.de/~astro/git/socksify-ruby.git/
This is a small drop-in to redirect any Ruby TCPClient connect through
a SOCKS5 proxy server.
http://cthulhu.c3d2.de/~astro/gitweb/?p=socksify-ruby.git;a=tree
-------
From charleseharvey at gmail.com Fri Feb 29 15:40:01 2008
From: charleseharvey at gmail.com (Charles Harvey)
Date: Fri, 29 Feb 2008 15:40:01 -0500
Subject: [Mechanize-users] problems getting back full data from post to
https://www.sss.gov/RegVer/wfVerification.aspx
Message-ID: <81894f030802291240n796750aeheb106b274876ecf@mail.gmail.com>
Why can't I get back the same full page as I do in firefox after a post?
I am trying to post data to https://www.sss.gov/RegVer/wfVerification.aspx
It is accepting the post, for both a valid record and a not found record,
but the html code
I am getting back in mechanize is not complete as the code I am getting back
in firefox view source.
In firefox the page works without javascript.
I read through the forum and tried google, but could not find answer.
Any help would be greatly appreciated.
Best Regards- Charles
pastie version of the same here
http://pastie.caboo.se/private/0qsxcjzwunvijds0quuonq
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac FireFox'
agent.redirect_ok = true
page = agent.get('https://www.sss.gov/RegVer/wfVerification.aspx')
sss_form = page.form('aspnetForm')
sss_form.[]=("_ctl0:ContentPlaceHolder1:tbSSAN", 555555999)
sss_form.[]=("_ctl0:ContentPlaceHolder1:tbLastName", "Harvey")
sss_form.[]=("_ctl0:ContentPlaceHolder1:tbDOB", "03251966")
page = agent.submit(sss_form, sss_form.buttons[1])
###{#################################################################}#
If I do it with my real data I get a line in the returned code that says it
was a success
>> sss_form.[]=("_ctl0:ContentPlaceHolder1:tbSSAN", 510999999)
=> 510565972
>> sss_form.[]=("_ctl0:ContentPlaceHolder1:tbLastName", "Smith")
=> "Harvey"
>> sss_form.[]=("_ctl0:ContentPlaceHolder1:tbDOB", "02021966")
=> "02051966"
>> page = agent.submit(sss_form, sss_form.buttons[1])
=> #}
{meta}
{title "\r\n\tSelective Service System: Verification Receipt\r\n"}
{iframes}
##########################################
If I submit it with a fake name and SS# I get a field as accepting the post
but returning as expected a no record found message.
>> page = agent.submit(sss_form, sss_form.buttons[1])
=> #}
{meta}
{title "\r\n\tSelective Service System: Registration Error\r\n"}
But with either one I do not get back the same information in my firefox
browser source namely:
******* From page source in firefox on a successful post - edited
**********
Last Name:Harvey Social Security Number:*** - ** -
5999 Date of Birth:03/25/1966 Selective Service Number: 66-0175555-2 Date of Registration 4/9/1984
*********************************************************************************************************
******* From page source in firefox on a post that does not have a valid
record- edited **********
Sorry. Based on the information you submitted
(information listed below), a registration record cannot be found for this
individual.
If you made a mistake when entering
data, please try a New Search. If you entered the data correctly, there
are several reasons why the registration may not be verifiable at this time.
Please dial 1-847-688-3117 for further information.(2/29/2008 11:09:16 AM)
Last Name: Smith
Social Security Number: *** - ** - 5999
Date
of Birth: 03/25/1966
pastie version of the same here
http://pastie.caboo.se/private/0qsxcjzwunvijds0quuonq
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20080229/cd7a4283/attachment-0001.html