From rdpoor at gmail.com Tue Dec 20 13:27:44 2011 From: rdpoor at gmail.com (Robert Poor) Date: Tue, 20 Dec 2011 10:27:44 -0800 Subject: [Mechanize-users] Mechanize GETting twice without redirect? Message-ID: [Cross posted to Ruby on Rails Forum and Mechanize mailing list.] I'm using Mechanize for page scraping (Ruby 1.9.2 / Rails 3.0.5 / Mechanize 2.0.1). I'm seeing a case where a single agent.get(url) generates two HTTP GETs. Why is this happening? The response to the first GET is a 200 (no redirect) and doesn't have any meta-refresh. I don't see why Mechanize is issuing the second GET (which happens to be failing with an EOFError with Content-Length / body length mismatch). Details: I'm using the nifty Charles web proxy debugger to monitor browser / server interactions. ===== In the original browser + server exchange, I see: Req: POST /login/Login HTTP/1.1 Rsp: sets two cookies + HTTP/1.1 302 Moved Temporarily => https://online.nationalgridus.com/eservice_enu/ Req: GET /eservice_enu/ HTTP/1.1 Rsp: set a cookie + HTTP/1.1 200 OK The body contains onLoad Javascript to set this.location = 'start.swe?SWECmd=Start' Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1 Rsp: sets four cookies + HTTP/1.1 200 OK ===== In the mechanize = server exchange: My code: page2 = agent.submit(login_form) Req: POST /login/Login HTTP/1.1 Rsp: set two cookies + HTTP/1.1 302 Moved Temporarily => https://online.nationalgridus.com/eservice_enu/ Req: GET /eservice_enu/ HTTP/1.1 Rsp: set a cookie + HTTP/1.1 200 OK The body contains onLoad Javascript to set this.location = 'start.swe?SWECmd=Start', but Mechanize can't follow that automatically. So I do an agent.get() to emulate it: My code: page3 = agent.get("https://online.nationalgridus.com/eservice_enu/start.swe?SWECmd=Start") Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1 Rsp: sets four cookies + HTTP/1.1 200 OK Note that at this point both the user driven and mechanize driven interactions appear to be identical. But Mechanize appears to generate another GET all by itself: Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1 Rsp: sets four cookies + HTTP/1.1 200 OK ... and this response throws an EOFError: Content-Length (536) does not match response body length (524) - EOFError ===== So: Why did Mechanize generate that last GET without me asking it to? Was the EOFError actually in the first GET and it's doing a retry? If so, how do I work around the length mismatch? From rdpoor at gmail.com Tue Dec 20 23:41:10 2011 From: rdpoor at gmail.com (Robert Poor) Date: Tue, 20 Dec 2011 20:41:10 -0800 Subject: [Mechanize-users] creating a Mechanize::Form from scratch... Message-ID: I need to POST a form that doesn't originate from a GET[*]. I know what belongs in the form. It's not clear to me how to create a Mechanize::Form from scratch -- the only way I can see to do it is GET a page that has a form on it, strip its contents and repopulate it. But is there an easier (or less surprising) way to do this? - ff [*] If you must know, the site I'm working with uses javascript to construct and post forms, so Mechanize doesn't have an obvious form to work with. From drbrain at segment7.net Tue Dec 20 23:55:27 2011 From: drbrain at segment7.net (Eric Hodel) Date: Tue, 20 Dec 2011 20:55:27 -0800 Subject: [Mechanize-users] creating a Mechanize::Form from scratch... In-Reply-To: References: Message-ID: <3B51B566-D88F-4AD5-9BF0-4F2C90541CEF@segment7.net> On Dec 20, 2011, at 8:41 PM, Robert Poor wrote: > I need to POST a form that doesn't originate from a GET[*]. I know > what belongs in the form. It's not clear to me how to create a > Mechanize::Form from scratch -- the only way I can see to do it is GET > a page that has a form on it, strip its contents and repopulate it. > But is there an easier (or less surprising) way to do this? You can place an HTML form in a file and load it from a file:// URL which has an action that points to the http:// URL to submit to. From sodani at gmail.com Thu Dec 29 10:33:19 2011 From: sodani at gmail.com (shig odani) Date: Thu, 29 Dec 2011 10:33:19 -0500 Subject: [Mechanize-users] deobfuscating javascript? Message-ID: Not sure if this post is quite appropriate for this list but wasn't sure where else to ask. I'm using mechanize to visit some pages that have obfuscated javascript and I'm wondering if there's some ruby or mechanize way to deobfuscate it or otherwise interact with the html elements which are being obfuscated by javascript. For more background, please see http://www.labnol.org/software/deobfuscate-javascript/19815/ -------------- next part -------------- An HTML attachment was scrubbed... URL: