From astarr at wiredquote.com Thu Aug 5 11:20:28 2010 From: astarr at wiredquote.com (Aaron Starr) Date: Thu, 5 Aug 2010 08:20:28 -0700 Subject: [Mechanize-users] Redirection patch Message-ID: Hi, all, I found that I had to make the following patch to Mechanize for one of the sites I'm scraping: alias fetch_page_original_version fetch_page def fetch_page(params) params[:uri] = params[:uri].gsub(/^https?:/i) {|m| m.downcase } if String == params[:uri].class fetch_page_original_version(params) end (Also, here: http://pastie.org/1077542) The problem was that the site was returning a 302 redirect with a Location header that looked like: httpS://www.blah-blah-blah... The weirdly capitalized protocol was causing EOF errors, so it needed to be adjusted. My versions: mechanize (1.0.0) ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux] Hopefully, this is helpful to someone. Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike at csa.net Thu Aug 5 13:13:41 2010 From: mike at csa.net (Mike Dalessio) Date: Thu, 5 Aug 2010 13:13:41 -0400 Subject: [Mechanize-users] Redirection patch In-Reply-To: References: Message-ID: Hi Aaron, On Thu, Aug 5, 2010 at 11:20 AM, Aaron Starr wrote: > Hi, all, > > I found that I had to make the following patch to Mechanize for one of the > sites I'm scraping: > > alias fetch_page_original_version fetch_page > def fetch_page(params) > params[:uri] = params[:uri].gsub(/^https?:/i) {|m| m.downcase } if > String == params[:uri].class > fetch_page_original_version(params) > end > > (Also, here: http://pastie.org/1077542) > > The problem was that the site was returning a 302 redirect with a Location > header that looked like: httpS://www.blah-blah-blah... The weirdly > capitalized protocol was causing EOF errors, so it needed to be adjusted. > > My versions: > > mechanize (1.0.0) > ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux] > > Hopefully, this is helpful to someone. > I've opened the following issue to record this information and hopefully pull the fix into a future version of mechanize: http://github.com/tenderlove/mechanize/issues#issue/44 Thanks for reporting it! > > Aaron > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From diego.virasoro at gmail.com Fri Aug 13 15:59:01 2010 From: diego.virasoro at gmail.com (Diego Virasoro) Date: Fri, 13 Aug 2010 20:59:01 +0100 Subject: [Mechanize-users] utf-8 error In-Reply-To: References: Message-ID: Hello, I am using mechanize to scrap data from a few websites, including Digg. On Digg from time to time I get the following error: ArgumentError: invalid byte sequence in UTF-8 testrun.rb:406:in `dump': can't dump hash with default proc (TypeError) Any idea how could I either fix it, or make mechanize ignore the error and continue? (it happens when I use the search method, so I'd be happy if it just keeps searching the rest of the page). Thank you Diego From diego.virasoro at gmail.com Fri Aug 13 16:00:10 2010 From: diego.virasoro at gmail.com (Diego Virasoro) Date: Fri, 13 Aug 2010 21:00:10 +0100 Subject: [Mechanize-users] timeout error In-Reply-To: References: Message-ID: Hello, is there a way to make Mechanize try again whenever it receives a timeout error? Thank you Diego From mike at csa.net Mon Aug 16 08:33:59 2010 From: mike at csa.net (Mike Dalessio) Date: Mon, 16 Aug 2010 08:33:59 -0400 Subject: [Mechanize-users] timeout error In-Reply-To: References: Message-ID: Hi Diego, On Fri, Aug 13, 2010 at 4:00 PM, Diego Virasoro wrote: > Hello, > is there a way to make Mechanize try again whenever it receives a timeout > error? Currently Mechanize will not automatically retry. However, you should easily be able to write an application-specific wrapper for `get()` that catches the raised exception (it should be Timeout::Error) and retries according to your application's specific needs. > > Thank you > Diego > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james at netlagoon.com Sat Aug 21 19:54:46 2010 From: james at netlagoon.com (James Fairbairn) Date: Sat, 21 Aug 2010 23:54:46 +0000 (UTC) Subject: [Mechanize-users] mechanize and eventmachine References: Message-ID: > I'm planning to process a bunch of pages in parallel, and I'd love to do it > with fibers and asynchronous web requests, something like > http://www.igvita.com/2010/03/22/untangling-evented-code-with- ruby-fibers/. > It looks like Mechanize is built around Net::HTTP, which AFAIK, is > synchronous only. Is there a way of mixing Eventmachine with Mechanize, or is > it too closely tied into Net::HTTP? I'm diving into code, but I'm wondering > if a) anyone's tried this, or maybe b) it's just crazy talk and I should use > threads or something, or even c) I'm just totally missing some larger point. If you're using Ruby 1.9 (I assume so, since you're talking about fibers), you might want to take a look at em-net-http (http://rubygems.org/gems/em-net-http), which I made just a few days ago. It patches Net::HTTP so that if it's running inside EM's reactor loop, it internally uses Fibers and em-http- request to process the request in a non-blocking fashion without any changes to calling code (aka the NeverBlock trick). The advantage of this approach is that it lets you continue using libraries like Mechanize (and rest- client, weary, right_aws, etc) that depend on Net::HTTP, while allowing you to achieve high concurrency, without any changes to your code or the library. Please note that it's not exhaustively tested yet, and I haven't actually tried it with Mechanize, so YMMV. (Insert standard plea for patches, tests and bug reports here.) :-) Thanks, James From astarr at wiredquote.com Mon Aug 23 16:50:51 2010 From: astarr at wiredquote.com (Aaron Starr) Date: Mon, 23 Aug 2010 13:50:51 -0700 Subject: [Mechanize-users] Logging response headers in multi-threaded implementation of Mechanize Message-ID: Hi, all, I'm using Mechanize in an environment where multiple threads are running simultaneously. Each thread uses its own Mechanize object. This works swimmingly, except that the threads step on each other when logging. So, I've used the following code to insure that each mechanize object has its own log that it's writing to, for each transaction. (Also at http://pastie.org/1110819). @mech = Mechanize.new do |mech| # put the log in the Mechanize object, and not in the class def mech.log=(val); @my_log = val; end def mech.log; @my_log; end end # [...] mech.log = Logger.new log_file_for_this_web_transaction The problem is that I am only getting the request information and request headers in the logs. The response headers go missing. Looking around a bit, I find that the reason is that Mechanize logs the response headers like this: if Mechanize.log Mechanize.log.debug #... So, the response headers are going to that non-existent class log. Anyone have a brilliant and insightful work-around? It would be really nice to have an independent, self-contained, whole log for each web transaction even when multiple threads are going at once. My versions: > > mechanize (1.0.0) > ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux] > Thanks in advance, Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: