From mikemondragon at gmail.com Mon Feb 4 15:06:50 2008 From: mikemondragon at gmail.com (Mike Mondragon) Date: Mon, 4 Feb 2008 12:06:50 -0800 Subject: [Mechanize-users] Weird error downloading a gzip'ed file In-Reply-To: References: Message-ID: <967d3b9a0802041206m3025279euff337e9513d42911@mail.gmail.com> On 11/12/07, gmoraes wrote: > Hi all, > > I've been using mechanize for a while and it rocks. Docs are pretty clear > and so far I've been able to do it on my own. > However, I'm stuck in a weird situation in a script to download my contact > list from hotmail. > I've used Firebug to check all urls, and tested it by hand while logged in > via browser. > Even in the script everything works well until the last 'agent.get_file', > which gets stuck with a weird error: > > ------ snip ------ > $ ruby msn-scrap.rb > # URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > > > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" > Err: unexpected end of file > Trace: > /usr/lib/ruby/1.8/mechanize.rb:372:in `read' > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page' > /usr/lib/ruby/1.8/net/http.rb:1050:in `request' > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body' > /usr/lib/ruby/1.8/net/http.rb:1049:in `request' > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page' > /usr/lib/ruby/1.8/net/http.rb:543:in `start' > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page' > /usr/lib/ruby/1.8/mechanize.rb:139:in `get' > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file' > msn-scrap.rb:32 I just wanted to follow up that I experienced this same issue when scraping Hotmail. There is a form on /mail/options.aspx?subsection=26&n=XXXXX that when posted will return a CSV file of your contacts, the response header has attachment with a content type of text/csv. But when you mimic the interaction with Mechanize the underlying Net::HTTP will read a number of bytes then unexpectedly raise an eof exception. Anyway, Hotmail seems to pretty up their own CSV as HTML on this page: /mail/PrintShell.aspx?type=contact and Mechanize can fetch that without any problems and then you can use Hpricot to get at contact attributes. That is how the Blackbook Gem is handling Hotmail. Blackbook Gem: http://rubyforge.org/frs/?group_id=4311 -- Mike Mondragon Work> http://sas.quat.ch/ Blog> http://blog.mondragon.cc/ From aaron at tenderlovemaking.com Mon Feb 4 15:28:26 2008 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Mon, 4 Feb 2008 12:28:26 -0800 Subject: [Mechanize-users] Weird error downloading a gzip'ed file In-Reply-To: <967d3b9a0802041206m3025279euff337e9513d42911@mail.gmail.com> References: <967d3b9a0802041206m3025279euff337e9513d42911@mail.gmail.com> Message-ID: <20080204202826.GA7282@mac-mini.lan> On Mon, Feb 04, 2008 at 12:06:50PM -0800, Mike Mondragon wrote: > On 11/12/07, gmoraes wrote: > > Hi all, > > > > I've been using mechanize for a while and it rocks. Docs are pretty clear > > and so far I've been able to do it on my own. > > However, I'm stuck in a weird situation in a script to download my contact > > list from hotmail. > > I've used Firebug to check all urls, and tested it by hand while logged in > > via browser. > > Even in the script everything works well until the last 'agent.get_file', > > which gets stuck with a weird error: > > > > ------ snip ------ > > $ ruby msn-scrap.rb > > # > URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > > > > > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" > > Err: unexpected end of file > > Trace: > > /usr/lib/ruby/1.8/mechanize.rb:372:in `read' > > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page' > > /usr/lib/ruby/1.8/net/http.rb:1050:in `request' > > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body' > > /usr/lib/ruby/1.8/net/http.rb:1049:in `request' > > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page' > > /usr/lib/ruby/1.8/net/http.rb:543:in `start' > > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page' > > /usr/lib/ruby/1.8/mechanize.rb:139:in `get' > > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file' > > msn-scrap.rb:32 > > I just wanted to follow up that I experienced this same issue when > scraping Hotmail. There is a form on > /mail/options.aspx?subsection=26&n=XXXXX that when posted will return > a CSV file of your contacts, the response header has attachment with a > content type of text/csv. But when you mimic the interaction with > Mechanize the underlying Net::HTTP will read a number of bytes then > unexpectedly raise an eof exception. > > Anyway, Hotmail seems to pretty up their own CSV as HTML on this page: > /mail/PrintShell.aspx?type=contact > and Mechanize can fetch that without any problems and then you can use > Hpricot to get at contact attributes. That is how the Blackbook Gem > is handling Hotmail. > > Blackbook Gem: http://rubyforge.org/frs/?group_id=4311 I think I've finally tracked down this error (thanks to postmodern). Its a bug in net/http. I've submitted a patch for ruby here: http://rubyforge.org/tracker/index.php?func=detail&aid=17778&group_id=426&atid=1700 And I'll add a monkey patch to mechanize to fix this in 0.7.1. -- Aaron Patterson http://tenderlovemaking.com/ From doreper at gmail.com Tue Feb 5 18:34:44 2008 From: doreper at gmail.com (Dave Oreper) Date: Tue, 5 Feb 2008 18:34:44 -0500 Subject: [Mechanize-users] Why does ClientForm check if self._current_form == self._global_form when closing select tags? Message-ID: Mechanize: __version__ = (0, 1, 8, "b", None) # 0.1.8b Why does ClientForm.py return instead of closing the option and select tag when self._current_form == self._global_form? ClientForm.py: def end_select(self): debug("") if self._current_form is self._global_form: return if self._option is not None: self._end_option() self._select = None When there is more than one select statement outside of a form (legal html), ClientForm (improperly) raises the nested SELECTs error because the first select is never closed. This is due to the fact that during __init__ of _AbstractFormParser, self._current_form = self._global_form = self.forms[0]. When end_select is called, self._current_form == self._global_form because there are no forms. We return instead of closing the option tag and the select tag. This is a regression from __version__ = (0, 0, 12, "a", None) # 0.0.12a. Under what circumstances would we want to return instead of closing the tag? The parser works correctly if end_select is changed to: ClientForm.py: def end_select(self): debug("") if self._option is not None: self._end_option() self._select = None Example form: NOTE: This form contains valid html however upon parse, nested SELECTs error is raised. -DaveO -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20080205/f967eb3e/attachment-0001.html From aaron at tenderlovemaking.com Tue Feb 5 23:16:48 2008 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Tue, 5 Feb 2008 20:16:48 -0800 Subject: [Mechanize-users] Why does ClientForm check if self._current_form == self._global_form when closing select tags? In-Reply-To: References: Message-ID: <20080206041648.GA23483@mac-mini.lan> On Tue, Feb 05, 2008 at 06:34:44PM -0500, Dave Oreper wrote: > Mechanize: __version__ = (0, 1, 8, "b", None) # 0.1.8b > > Why does ClientForm.py return instead of closing the option and select tag > when self._current_form == self._global_form? > > ClientForm.py: > def end_select(self): > debug("") > if self._current_form is self._global_form: > return > if self._option is not None: > self._end_option() > self._select = None > > When there is more than one select statement outside of a form (legal html), > ClientForm (improperly) raises the nested SELECTs error because the first > select is never closed. This is due to the fact that during __init__ of > _AbstractFormParser, self._current_form = self._global_form = self.forms[0]. > When end_select is called, self._current_form == self._global_form because > there are no forms. We return instead of closing the option tag and the > select tag. This is a regression from __version__ = (0, 0, 12, "a", None) # > 0.0.12a. > > Under what circumstances would we want to return instead of closing the tag? > > The parser works correctly if end_select is changed to: > ClientForm.py: > def end_select(self): > debug("") > if self._option is not None: > self._end_option() > self._select = None > > > > Example form: > NOTE: This form contains valid html however upon parse, nested SELECTs error > is raised. > > > > > > > > > > > > Try 'gem install mechanize' -- Aaron Patterson http://tenderlovemaking.com/ From wynst.uei at gmail.com Mon Feb 11 00:40:14 2008 From: wynst.uei at gmail.com (Cho Cogo) Date: Mon, 11 Feb 2008 12:40:14 +0700 Subject: [Mechanize-users] [ask] how to make mechanize download using Net::SSH SOCKS5 Message-ID: I have a local ssh session, how can i tell mechanize to use that connection, instead standard net::http. I've tried to setup the proxy but it wouldnt work. Thanks. WyNst From aaron at tenderlovemaking.com Mon Feb 11 14:03:07 2008 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Mon, 11 Feb 2008 11:03:07 -0800 Subject: [Mechanize-users] [ask] how to make mechanize download using Net::SSH SOCKS5 In-Reply-To: References: Message-ID: <20080211190306.GA15112@mac-mini.lan> On Mon, Feb 11, 2008 at 12:40:14PM +0700, Cho Cogo wrote: > I have a local ssh session, how can i tell mechanize to use that > connection, instead standard net::http. I've tried to setup the proxy > but it wouldnt work. I don't understand. Are you using ssh to forward a port? -- Aaron Patterson http://tenderlovemaking.com/ From mjerome at gmail.com Fri Feb 15 04:26:30 2008 From: mjerome at gmail.com (Michael Jerome) Date: Fri, 15 Feb 2008 09:26:30 +0000 Subject: [Mechanize-users] Performance - can anyone suggest why this is slow? Message-ID: Hello All This is my first post to mechanize-users at rubyforge.org so be gentle! Please take a look at the code and output below. The code uses Mechanize to load two web pages; the output shows the time before and after loading each page. Both pages load successfully but while http://www.google.com takes less than a second, http://tinyurl.com/7pqpx is taking about THREE MINUTES!!! I realise that the second page is "heavier" than the first and I realise there are HTTP and META redirects but something isn't right if it's taking three minutes ... Firefox loads http://tinyurl.com/7pqpx in seconds. Can anyone suggest what's causing it to be so slow? Is it possible the site is detecting I'm using Mechanize and slowing it down intentionally? Any help or advice is really appreciated. Regards Mike ################# This code: ["http://www.google.com", "http://tinyurl.com/7pqpx"].each do |url| agent = WWW::Mechanize.new agent.user_agent_alias = 'Windows IE 7' agent.follow_meta_refresh = true puts "Started getting #{url} at #{Time.now}" page = agent.get( url ) puts "Completed getting #{url} at #{Time.now}" end ################# Gives this output: Started getting http://www.google.com at Fri Feb 15 09:10:15 Completed getting http://www.google.com at Fri Feb 15 09:10:15 Started getting http://tinyurl.com/7pqpx at Fri Feb 15 09:10:15 Completed getting http://tinyurl.com/7pqpx at Fri Feb 15 09:13:16 ################# From wynst.uei at gmail.com Tue Feb 26 06:23:46 2008 From: wynst.uei at gmail.com (Cho Cogo) Date: Tue, 26 Feb 2008 18:23:46 +0700 Subject: [Mechanize-users] [ask] how to make mechanize download using Net::SSH SOCKS5 In-Reply-To: References: Message-ID: On 2/11/08, Cho Cogo wrote: > I have a local ssh session, how can i tell mechanize to use that > connection, instead standard net::http. I've tried to setup the proxy > but it wouldnt work. Hi, after a bit googling I found the solution.. ------ For a SOCKS5 proxy, get SOCKSify-Ruby: git clone http://cthulhu.c3d2.de/~astro/git/socksify-ruby.git/ This is a small drop-in to redirect any Ruby TCPClient connect through a SOCKS5 proxy server. http://cthulhu.c3d2.de/~astro/gitweb/?p=socksify-ruby.git;a=tree ------- From charleseharvey at gmail.com Fri Feb 29 15:40:01 2008 From: charleseharvey at gmail.com (Charles Harvey) Date: Fri, 29 Feb 2008 15:40:01 -0500 Subject: [Mechanize-users] problems getting back full data from post to https://www.sss.gov/RegVer/wfVerification.aspx Message-ID: <81894f030802291240n796750aeheb106b274876ecf@mail.gmail.com> Why can't I get back the same full page as I do in firefox after a post? I am trying to post data to https://www.sss.gov/RegVer/wfVerification.aspx It is accepting the post, for both a valid record and a not found record, but the html code I am getting back in mechanize is not complete as the code I am getting back in firefox view source. In firefox the page works without javascript. I read through the forum and tried google, but could not find answer. Any help would be greatly appreciated. Best Regards- Charles pastie version of the same here http://pastie.caboo.se/private/0qsxcjzwunvijds0quuonq require 'rubygems' require 'mechanize' agent = WWW::Mechanize.new agent.user_agent_alias = 'Mac FireFox' agent.redirect_ok = true page = agent.get('https://www.sss.gov/RegVer/wfVerification.aspx') sss_form = page.form('aspnetForm') sss_form.[]=("_ctl0:ContentPlaceHolder1:tbSSAN", 555555999) sss_form.[]=("_ctl0:ContentPlaceHolder1:tbLastName", "Harvey") sss_form.[]=("_ctl0:ContentPlaceHolder1:tbDOB", "03251966") page = agent.submit(sss_form, sss_form.buttons[1]) ###{#################################################################}# If I do it with my real data I get a line in the returned code that says it was a success >> sss_form.[]=("_ctl0:ContentPlaceHolder1:tbSSAN", 510999999) => 510565972 >> sss_form.[]=("_ctl0:ContentPlaceHolder1:tbLastName", "Smith") => "Harvey" >> sss_form.[]=("_ctl0:ContentPlaceHolder1:tbDOB", "02021966") => "02051966" >> page = agent.submit(sss_form, sss_form.buttons[1]) => #} {meta} {title "\r\n\tSelective Service System: Verification Receipt\r\n"} {iframes} ########################################## If I submit it with a fake name and SS# I get a field as accepting the post but returning as expected a no record found message. >> page = agent.submit(sss_form, sss_form.buttons[1]) => #} {meta} {title "\r\n\tSelective Service System: Registration Error\r\n"} But with either one I do not get back the same information in my firefox browser source namely: ******* From page source in firefox on a successful post - edited ********** Last Name: Harvey
Social Security Number: *** - ** - 5999
Date of Birth: 03/25/1966
Selective Service Number:
66-0175555-2
Date of Registration
4/9/1984

********************************************************************************************************* ******* From page source in firefox on a post that does not have a valid record- edited ********** Sorry.
Based on the information you submitted (information listed below), a registration record cannot be found for this individual.

If you made a mistake when entering data, please try a
New Search . If you entered the data correctly, there are several reasons why the registration may not be verifiable at this time. Please dial 1-847-688-3117 for further information. (2/29/2008 11:09:16 AM)


Last Name:
Smith

Social Security Number: *** - ** - 5999

Date of Birth: 03/25/1966

  pastie version of the same here http://pastie.caboo.se/private/0qsxcjzwunvijds0quuonq -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20080229/cd7a4283/attachment-0001.html