From carl at youngbloods.org Wed Oct 6 08:56:28 2010 From: carl at youngbloods.org (Carl Youngblood) Date: Wed, 6 Oct 2010 14:56:28 +0200 Subject: [Mechanize-users] Basic auth question Message-ID: Hey, I have been studying the mechanize code for basic authentication. Based on the following lines: http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L215-220 http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L625-643 http://github.com/tenderlove/mechanize/blob/master/lib/mechanize/chain/auth_headers.rb#L20-33 it seems that if the auth method is determined to be basic_auth, nothing really happens. All it does is set @user and @password, but these do not appear to ever be used later on in the execution. Is this intentional? Is it because pretty much anything that supports basic auth also supports Digest auth? Thanks, Carl From aaron.patterson at gmail.com Wed Oct 6 10:53:44 2010 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Wed, 6 Oct 2010 08:53:44 -0600 Subject: [Mechanize-users] Basic auth question In-Reply-To: References: Message-ID: On Wed, Oct 6, 2010 at 6:56 AM, Carl Youngblood wrote: > Hey, I have been studying the mechanize code for basic authentication. > Based on the following lines: > > http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L215-220 > http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L625-643 > http://github.com/tenderlove/mechanize/blob/master/lib/mechanize/chain/auth_headers.rb#L20-33 > > it seems that if the auth method is determined to be basic_auth, > nothing really happens. All it does is set @user and @password, but > these do not appear to ever be used later on in the execution. Is this > intentional? Is it because pretty much anything that supports basic > auth also supports Digest auth? Unless there is a bug I'm not seeing, mechanize should pass the basic auth info to net/http. When it determines that it should use basic auth here: http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L635 It calls the basic_auth method on net/http here: http://github.com/tenderlove/mechanize/blob/master/lib/mechanize/chain/auth_headers.rb#L22-23 Hope that helps! -- Aaron Patterson http://tenderlovemaking.com/ From carl at youngbloods.org Wed Oct 6 10:59:36 2010 From: carl at youngbloods.org (Carl Youngblood) Date: Wed, 6 Oct 2010 16:59:36 +0200 Subject: [Mechanize-users] Basic auth question In-Reply-To: References: Message-ID: Thanks, I didn't realize that the object it was working on was Net::HTTP. On Wed, Oct 6, 2010 at 4:53 PM, Aaron Patterson wrote: > On Wed, Oct 6, 2010 at 6:56 AM, Carl Youngblood wrote: >> Hey, I have been studying the mechanize code for basic authentication. >> Based on the following lines: >> >> http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L215-220 >> http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L625-643 >> http://github.com/tenderlove/mechanize/blob/master/lib/mechanize/chain/auth_headers.rb#L20-33 >> >> it seems that if the auth method is determined to be basic_auth, >> nothing really happens. All it does is set @user and @password, but >> these do not appear to ever be used later on in the execution. Is this >> intentional? Is it because pretty much anything that supports basic >> auth also supports Digest auth? > > Unless there is a bug I'm not seeing, mechanize should pass the basic > auth info to net/http. > > When it determines that it should use basic auth here: > > ?http://github.com/tenderlove/mechanize/blob/master/lib/mechanize.rb#L635 > > It calls the basic_auth method on net/http here: > > ?http://github.com/tenderlove/mechanize/blob/master/lib/mechanize/chain/auth_headers.rb#L22-23 > > Hope that helps! > > -- > Aaron Patterson > http://tenderlovemaking.com/ > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > From rahul.thathoo at gmail.com Wed Oct 6 19:33:03 2010 From: rahul.thathoo at gmail.com (Rahul Thathoo) Date: Wed, 6 Oct 2010 16:33:03 -0700 Subject: [Mechanize-users] can you mechanize.get('http://www.sephora.com/browse/product.jhtml?id=P230234') without error? Message-ID: hey guys, i cant seem to mechanize.get(' http://www.sephora.com/browse/product.jhtml?id=P230234'), it errors out as: /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain/response_header_handler.rb:32:in `handle': invalid byte sequence in UTF-8 (ArgumentError) from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain.rb:30:in `pass' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain/handler.rb:6:in `handle' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain.rb:30:in `pass' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain/handler.rb:6:in `handle' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize/chain.rb:25:in `handle' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize.rb:506:in `fetch_page' from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-0.9.2/lib/www/mechanize.rb:234:in `get' from del.rb:15:in `
' and line 32 on response_header_handler.rb is: if page.is_a?(Page) && page.body =~ /Set-Cookie/n i am on ruby 1.9 and mechanize-0.9.2 obviously... any ideas here? changing page.body.include?('Set-Cookie') works... -------------- next part -------------- An HTML attachment was scrubbed... URL: From moonshiner at retn.net Wed Oct 20 06:04:17 2010 From: moonshiner at retn.net (Gazizov Andrey) Date: Wed, 20 Oct 2010 14:04:17 +0400 Subject: [Mechanize-users] HTTP POST request using Mechanize Message-ID: <4CBEBEA1.7090303@retn.net> Hello guys, I'm newbie in ruby and Mechanize. I has created a script to make a POST and GET HTTP request to Redmine application. So, everything fine with GET request and authorization. But when I try to do a POST request Mechanize gave a following error: ws-moonshiner# ruby post.rb /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:464:in `post_form': 500 => Net::HTTPInternalServerError (Mechanize::ResponseCodeError) from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:357:in `post' from post.rb:21 Here is my script: require 'rubygems' require 'mechanize' require 'logger' # runnig a Mechanize agent = Mechanize.new { |a| a.user_agent_alias = 'Mac Safari'} agent.log = Logger.new(STDOUT) # get the Redmine login form & fill it out with the username/password page = agent.get("http://127.0.0.1/login") login_form = page.form_with(:action => "/login") login_form.username = 'test' login_form.password = 'test' # submit Redmine login form page = agent.submit login_form issues_page = agent.get("http://127.0.0.1/projects/test/issues") agent.post(issues_page, { "priority_id" => "3", 'tracker_id' => "1", }) Latest message from logger: [...] I, [2010-10-20T13:56:00.340095 #16626] INFO -- : status: 500 I understand that for some reason server generates an internal error. But I have no idea how to fix it. Does anybody know how to resolve this issue? Thank you very much in advance. -- Best regards, Andrew From moonshiner at retn.net Wed Oct 20 08:26:55 2010 From: moonshiner at retn.net (Gazizov Andrey) Date: Wed, 20 Oct 2010 16:26:55 +0400 Subject: [Mechanize-users] HTTP POST request using Mechanize Message-ID: <4CBEE00F.7070309@retn.net> Hello guys, I'm newbie in Ruby and Mechanize. I has created a script to make a POST and GET HTTP request to Redmine application. So, everything fine with GET request and authorization. But when I try to do a POST request Mechanize gave a following error: ws-moonshiner# ruby post.rb /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:464:in `post_form': 500 => Net::HTTPInternalServerError (Mechanize::ResponseCodeError) from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:357:in `post' from post.rb:21 Here is my script: require 'rubygems' require 'mechanize' require 'logger' # runnig a Mechanize agent = Mechanize.new { |a| a.user_agent_alias = 'Mac Safari'} agent.log = Logger.new(STDOUT) # get the Redmine login form & fill it out with the username/password page = agent.get("http://127.0.0.1/login") login_form = page.form_with(:action => "/login") login_form.username = 'test' login_form.password = 'test' # submit Redmine login form page = agent.submit login_form issues_page = agent.get("http://127.0.0.1/projects/test/issues") agent.post(issues_page, { "priority_id" => "3", 'tracker_id' => "1", }) Latest message from logger: [...] I, [2010-10-20T13:56:00.340095 #16626] INFO -- : status: 500 I understand that for some reason server generates an internal error. But I have no idea how to fix it. Does anybody know how to resolve this issue? Thank you very much in advance. -- Best regards, Andrew From chris at kimptoc.net Wed Oct 20 11:58:09 2010 From: chris at kimptoc.net (Chris Kimpton) Date: Wed, 20 Oct 2010 16:58:09 +0100 Subject: [Mechanize-users] HTTP POST request using Mechanize In-Reply-To: <4CBEE00F.7070309@retn.net> References: <4CBEE00F.7070309@retn.net> Message-ID: Hi, That looks like an error from the webserver side - http://127.0.0.1/login - not a mechanize issue. Probably worth looking in your server logs... Regards, Chris On 20 October 2010 13:26, Gazizov Andrey wrote: > Hello guys, > > I'm newbie in Ruby and Mechanize. I has created a script to make a POST > and GET HTTP request to Redmine application. So, everything fine with > GET request and authorization. But when I try to do a POST request > Mechanize gave a following error: > > ws-moonshiner# ruby post.rb > /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:464:in > `post_form': 500 => Net::HTTPInternalServerError > (Mechanize::ResponseCodeError) > from > /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:357:in > `post' > from post.rb:21 > > Here is my script: > > require 'rubygems' > require 'mechanize' > require 'logger' > > # runnig a Mechanize > agent = Mechanize.new { |a| a.user_agent_alias = 'Mac Safari'} > > agent.log = Logger.new(STDOUT) > > # get the Redmine login form & fill it out with the username/password > page = agent.get("http://127.0.0.1/login") > login_form = page.form_with(:action => "/login") > login_form.username = 'test' > login_form.password = 'test' > > # submit Redmine login form > page = agent.submit login_form > > issues_page = agent.get("http://127.0.0.1/projects/test/issues") > > agent.post(issues_page, { > "priority_id" => "3", > 'tracker_id' => "1", > }) > > Latest message from logger: > [...] > I, [2010-10-20T13:56:00.340095 #16626] INFO -- : status: 500 > > I understand that for some reason server generates an internal error. > But I have no idea how to fix it. Does anybody know how to resolve this > issue? > > Thank you very much in advance. > > -- > Best regards, > Andrew > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From overco at me.com Wed Oct 20 13:12:00 2010 From: overco at me.com (overco at me.com) Date: Wed, 20 Oct 2010 13:12:00 -0400 Subject: [Mechanize-users] Scrape data from AdWords Message-ID: <99D08122-AFF4-4CD8-B2D9-86C72BE29610@me.com> http://www.skyrocketonlinemarketing.com/2010/06/20/scrape-ppc-spend-data-from-adwords-with-ruby-mechanize/ Jonathan Clarke ? Campaign Manager U365 Cell Number: +1-246-256-0770 Skype IM Chat: jonathan.clarke This message is intended only for the addressee and contains privileged and confidential information. If you have received this message in error please notify me immediately and delete the original message and destroy any copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From astahl at hi5.com Wed Oct 20 14:28:11 2010 From: astahl at hi5.com (Alex Stahl) Date: Wed, 20 Oct 2010 11:28:11 -0700 Subject: [Mechanize-users] HTTP POST request using Mechanize In-Reply-To: References: <4CBEE00F.7070309@retn.net> Message-ID: <1287599291.1869.73.camel@awstahl-t61> To elaborate a little (since I found this out the hard way)... Mechanize only handles HTTP response codes 200 and 302 - others will throw an exception. So what you're seeing is, as Chris notes, an error on the server side which returns an HTTP 500, which then throws an exception, which isn't handled and causes the result you're seeing. Two things you might want to try: 1. Wrap the post in a begin/rescue block so you can handle the exception 2. Use a packet sniffer to analyze the difference between the post request you're sending through Mechanize and what a browser sends when you submit the form. The server is likely returning 500 due to a malformed request. On Wed, 2010-10-20 at 10:58 -0500, Chris Kimpton wrote: > Hi, > > That looks like an error from the webserver side - > http://127.0.0.1/login - not a mechanize issue. > > Probably worth looking in your server logs... > > Regards, > Chris > > > On 20 October 2010 13:26, Gazizov Andrey wrote: > > Hello guys, > > I'm newbie in Ruby and Mechanize. I has created a script to > make a POST > and GET HTTP request to Redmine application. So, everything > fine with > GET request and authorization. But when I try to do a POST > request > Mechanize gave a following error: > > ws-moonshiner# ruby post.rb > /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:464:in > `post_form': 500 => Net::HTTPInternalServerError > (Mechanize::ResponseCodeError) > from > /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:357:in > `post' > from post.rb:21 > > Here is my script: > > require 'rubygems' > require 'mechanize' > require 'logger' > > # runnig a Mechanize > agent = Mechanize.new { |a| a.user_agent_alias = 'Mac Safari'} > > agent.log = Logger.new(STDOUT) > > # get the Redmine login form & fill it out with the > username/password > page = agent.get("http://127.0.0.1/login") > login_form = page.form_with(:action => "/login") > login_form.username = 'test' > login_form.password = 'test' > > # submit Redmine login form > page = agent.submit login_form > > issues_page = > agent.get("http://127.0.0.1/projects/test/issues") > > agent.post(issues_page, { > "priority_id" => "3", > 'tracker_id' => "1", > }) > > Latest message from logger: > [...] > I, [2010-10-20T13:56:00.340095 #16626] INFO -- : status: 500 > > I understand that for some reason server generates an internal > error. > But I have no idea how to fix it. Does anybody know how to > resolve this > issue? > > Thank you very much in advance. > > -- > Best regards, > Andrew > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hs.shafiei at gmail.com Wed Oct 20 18:00:59 2010 From: hs.shafiei at gmail.com (Hosein Shafiei) Date: Thu, 21 Oct 2010 01:30:59 +0330 Subject: [Mechanize-users] Mechanize::ResponseCodeError: 500 Message-ID: Also, excuse me if my issue is a little bit application-specific. I need Mechanize to submit bunch of data from our server to an online reservation system. The url ishttp://hajres.iranair.com/haj/request.php" It works in my browser however, using mechanize following error appears: >> agent.get("http://hajres.iranair.com/haj/request.php") Net::HTTP::Get: /haj/request.php request-header: accept-language => en-us,en;q=0.5 request-header: accept => */* request-header: user-agent => WWW-Mechanize/1.0.0 ( http://rubyforge.org/projects/mechanize/) request-header: connection => keep-alive request-header: accept-encoding => gzip,identity request-header: cookie => PHPSESSID=pf1qrf54our96oup7ofei385n3 request-header: host => hajres.iranair.com request-header: accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7 request-header: keep-alive => 300 Read 72 bytes response-header: x-powered-by => PHP/5.2.5 response-header: expires => Thu, 19 Nov 1981 08:52:00 GMT response-header: content-type => text/html; charset=cp-1256 response-header: connection => close response-header: server => Apache/2.2.8 (Fedora) response-header: date => Wed, 20 Oct 2010 19:21:53 GMT response-header: content-length => 72 response-header: cache-control => no-store, no-cache, must-revalidate, post-check=0, pre-check=0 response-header: pragma => no-cache status: 500 Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError from /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in `get' from (irb):26 Could you realize what is wrong? p.s. I also changed agent type. -------------- next part -------------- An HTML attachment was scrubbed... URL: From oscartheduck at gmail.com Wed Oct 20 18:57:56 2010 From: oscartheduck at gmail.com (James) Date: Wed, 20 Oct 2010 16:57:56 -0600 Subject: [Mechanize-users] Mechanize::ResponseCodeError: 500 In-Reply-To: References: Message-ID: Visiting the site in a web browser, it looks like it's not responsive right now. Which makes sense, given that you're getting an error 500. That's an http error code: 10.5.1 500 Internal Server Error The server encountered an unexpected condition which prevented it from fulfilling the request. On Wed, Oct 20, 2010 at 4:00 PM, Hosein Shafiei wrote: > Also, excuse me if my issue is a little bit application-specific. I need > Mechanize to submit bunch of data from our server to an online reservation > system. The url ishttp://hajres.iranair.com/haj/request.php" > It works in my browser however, using mechanize following error appears: > > >> agent.get("http://hajres.iranair.com/haj/request.php") > Net::HTTP::Get: /haj/request.php > request-header: accept-language => en-us,en;q=0.5 > request-header: accept => */* > request-header: user-agent => WWW-Mechanize/1.0.0 ( > http://rubyforge.org/projects/mechanize/) > request-header: connection => keep-alive > request-header: accept-encoding => gzip,identity > request-header: cookie => PHPSESSID=pf1qrf54our96oup7ofei385n3 > request-header: host => hajres.iranair.com > request-header: accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7 > request-header: keep-alive => 300 > Read 72 bytes > response-header: x-powered-by => PHP/5.2.5 > response-header: expires => Thu, 19 Nov 1981 08:52:00 GMT > response-header: content-type => text/html; charset=cp-1256 > response-header: connection => close > response-header: server => Apache/2.2.8 (Fedora) > response-header: date => Wed, 20 Oct 2010 19:21:53 GMT > response-header: content-length => 72 > response-header: cache-control => no-store, no-cache, must-revalidate, > post-check=0, pre-check=0 > response-header: pragma => no-cache > status: 500 > Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError > from /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in > `get' > from (irb):26 > > Could you realize what is wrong? > p.s. I also changed agent type. > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hs.shafiei at gmail.com Thu Oct 21 11:12:24 2010 From: hs.shafiei at gmail.com (Hosein Shafiei) Date: Thu, 21 Oct 2010 18:42:24 +0330 Subject: [Mechanize-users] Mechanize::ResponseCodeError: 500 In-Reply-To: References: Message-ID: I realized It only works with IE and Chrome other browsers return 500 error. So I changed agent_alias using : >> agent.user_agent_alias = "Windows IE 6" => "Windows IE 6" again no luck ! could anyone please help me. On Thu, Oct 21, 2010 at 2:27 AM, James wrote: > Visiting the site in a web browser, it looks like it's not responsive right > now. > > Which makes sense, given that you're getting an error 500. That's an http > error code: > > 10.5.1 500 Internal Server Error > > The server encountered an unexpected condition which prevented it from > fulfilling the request. > > On Wed, Oct 20, 2010 at 4:00 PM, Hosein Shafiei wrote: > >> Also, excuse me if my issue is a little bit application-specific. I need >> Mechanize to submit bunch of data from our server to an online reservation >> system. The url ishttp://hajres.iranair.com/haj/request.php" >> It works in my browser however, using mechanize following error appears: >> >> >> agent.get("http://hajres.iranair.com/haj/request.php") >> Net::HTTP::Get: /haj/request.php >> request-header: accept-language => en-us,en;q=0.5 >> request-header: accept => */* >> request-header: user-agent => WWW-Mechanize/1.0.0 ( >> http://rubyforge.org/projects/mechanize/) >> request-header: connection => keep-alive >> request-header: accept-encoding => gzip,identity >> request-header: cookie => PHPSESSID=pf1qrf54our96oup7ofei385n3 >> request-header: host => hajres.iranair.com >> request-header: accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7 >> request-header: keep-alive => 300 >> Read 72 bytes >> response-header: x-powered-by => PHP/5.2.5 >> response-header: expires => Thu, 19 Nov 1981 08:52:00 GMT >> response-header: content-type => text/html; charset=cp-1256 >> response-header: connection => close >> response-header: server => Apache/2.2.8 (Fedora) >> response-header: date => Wed, 20 Oct 2010 19:21:53 GMT >> response-header: content-length => 72 >> response-header: cache-control => no-store, no-cache, must-revalidate, >> post-check=0, pre-check=0 >> response-header: pragma => no-cache >> status: 500 >> Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError >> from /Library/Ruby/Gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in >> `get' >> from (irb):26 >> >> Could you realize what is wrong? >> p.s. I also changed agent type. >> >> _______________________________________________ >> Mechanize-users mailing list >> Mechanize-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mechanize-users >> > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amuino at gmail.com Tue Oct 26 09:18:21 2010 From: amuino at gmail.com (=?iso-8859-1?Q?Abel_Mui=F1o_Vizcaino?=) Date: Tue, 26 Oct 2010 15:18:21 +0200 Subject: [Mechanize-users] Infinite loop on meta refresh Message-ID: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> Hi! I have a problem when using Mechanize as a crawler with some pages which have long-delayed meta refresh to the same url. First, some context: I want to reach "the real thing" when navigating to urls with delays... so following all kind of redirects (including meta refresh) is on my whishlist. What I have found is that pages such as TechMeme have a meta refresh to the same URL with very long waits (1800 seconds). For these pages it takes too long to reach the maximum number of redirects (and there is no real value in following the redirects either). For theses situations, it is not clear what the best option is, since several factors are at play. I would propose having a flag which avoids redirecting (and waiting) when the refresh is to the same url. This would be off by default allowing other use cases. Other option that could be useful for a wider range of use cases is ignoring waits on meta refresh. This is summarized also as an issue on github: http://github.com/tenderlove/mechanize/issues/issue/67 I am very new to Mechanize and I might be abusing it :-). Or just not seeing a better way of handling this. So, before I embark in forking & everything... feedback on this issue is welcome! -- Abel Mui?o -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike at csa.net Tue Oct 26 11:58:52 2010 From: mike at csa.net (Mike Dalessio) Date: Tue, 26 Oct 2010 11:58:52 -0400 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> Message-ID: Hi! On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino wrote: > Hi! > > I have a problem when using Mechanize as a crawler with some pages which > have long-delayed meta refresh to the same url. > > First, some context: I want to reach "the real thing" when navigating to > urls with delays... so following all kind of redirects (including meta > refresh) is on my whishlist. > > What I have found is that pages such as TechMeme have > a meta refresh to the same URL with very long waits (1800 seconds). For > these pages it takes too long to reach the maximum number of redirects (and > there is no real value in following the redirects either). > What value is there in following meta-refreshes with long waits? Why not just turn off follow_meta_refresh? I think I don't totally understand what it is you're trying to do. > > For theses situations, it is not clear what the best option is, since > several factors are at play. > > I would propose having a flag which avoids redirecting (and waiting) when > the refresh is to the same url. This would be off by default allowing other > use cases. > > Other option that could be useful for a wider range of use cases is > ignoring waits on meta refresh. > > This is summarized also as an issue on github: > http://github.com/tenderlove/mechanize/issues/issue/67 > > I am very new to Mechanize and I might be abusing it :-). Or just not > seeing a better way of handling this. > > So, before I embark in forking & everything... feedback on this issue is > welcome! > -- > Abel Mui?o > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amuino at gmail.com Tue Oct 26 12:54:40 2010 From: amuino at gmail.com (=?iso-8859-1?Q?Abel_Mui=F1o_Vizcaino?=) Date: Tue, 26 Oct 2010 18:54:40 +0200 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> Message-ID: <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> Below: El 26/10/2010, a las 17:58, Mike Dalessio escribi?: > Hi! > > On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino wrote: > Hi! > > I have a problem when using Mechanize as a crawler with some pages which have long-delayed meta refresh to the same url. > > First, some context: I want to reach "the real thing" when navigating to urls with delays... so following all kind of redirects (including meta refresh) is on my whishlist. > > What I have found is that pages such as TechMeme have a meta refresh to the same URL with very long waits (1800 seconds). For these pages it takes too long to reach the maximum number of redirects (and there is no real value in following the redirects either). > > What value is there in following meta-refreshes with long waits? Why not just turn off follow_meta_refresh? I think I don't totally understand what it is you're trying to do. For my problem, I want to reach the last page in any sequence of redirects (be it http or meta refresh) and do some work with that last page. Additionally, I don't know the page I'll be crawling beforehand so I can't enable/disable meta refresh depending on wether it is going to cause trouble or not. It is great that Mechanize can get me halfway there (other engines don't even handle meta refresh) but it is not fully solving my problem. From a more general point of view, this issue with meta refresh pointing to the same page is going to be a problem for any crawler (which follows meta refresh tags, anyway). Of course I can use turn off follow_meta_refresh and implement my own meta refresh handling outside mechanize, but since there is some support for that, I think it would be better to enhance it than to reimplement it in my code. What I would like to do (maybe through options) is to ignore waits when redirecting to a different url and to stop following meta refresh when they point to the same url. > > > For theses situations, it is not clear what the best option is, since several factors are at play. > > I would propose having a flag which avoids redirecting (and waiting) when the refresh is to the same url. This would be off by default allowing other use cases. > > Other option that could be useful for a wider range of use cases is ignoring waits on meta refresh. > > This is summarized also as an issue on github: http://github.com/tenderlove/mechanize/issues/issue/67 > > I am very new to Mechanize and I might be abusing it :-). Or just not seeing a better way of handling this. > > So, before I embark in forking & everything... feedback on this issue is welcome! > -- > Abel Mui?o > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike at csa.net Tue Oct 26 15:40:10 2010 From: mike at csa.net (Mike Dalessio) Date: Tue, 26 Oct 2010 15:40:10 -0400 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> Message-ID: On Tue, Oct 26, 2010 at 12:54 PM, Abel Mui?o Vizcaino wrote: > Below: > > El 26/10/2010, a las 17:58, Mike Dalessio escribi?: > > Hi! > > On Tue, Oct 26, 2010 at 9:18 AM, Abel Mui?o Vizcaino wrote: > >> Hi! >> >> I have a problem when using Mechanize as a crawler with some pages which >> have long-delayed meta refresh to the same url. >> >> First, some context: I want to reach "the real thing" when navigating to >> urls with delays... so following all kind of redirects (including meta >> refresh) is on my whishlist. >> >> What I have found is that pages such as TechMeme have >> a meta refresh to the same URL with very long waits (1800 seconds). For >> these pages it takes too long to reach the maximum number of redirects (and >> there is no real value in following the redirects either). >> > > What value is there in following meta-refreshes with long waits? Why not > just turn off follow_meta_refresh? I think I don't totally understand what > it is you're trying to do. > > > For my problem, I want to reach the last page in any sequence of redirects > (be it http or meta refresh) and do some work with that last page. > Additionally, I don't know the page I'll be crawling beforehand so I can't > enable/disable meta refresh depending on wether it is going to cause trouble > or not. > > It is great that Mechanize can get me halfway there (other engines don't > even handle meta refresh) but it is not fully solving my problem. > > From a more general point of view, this issue with meta refresh pointing to > the same page is going to be a problem for any crawler (which follows meta > refresh tags, anyway). > > Of course I can use turn off follow_meta_refresh and implement my own meta > refresh handling outside mechanize, but since there is some support for > that, I think it would be better to enhance it than to reimplement it in my > code. > > What I would like to do (maybe through options) is to ignore waits when > redirecting to a different url and to stop following meta refresh when they > point to the same url. > > Ok, let me rephrase what I think you're asking for, and you tell me if I'm correct. You'd like to ignore tags like: You'd like to follow tags like: That is, if content contains a URL to be followed, you'd like to follow it, but otherwise, not. Is that correct? -------------- next part -------------- An HTML attachment was scrubbed... URL: From amuino at gmail.com Tue Oct 26 16:11:42 2010 From: amuino at gmail.com (=?iso-8859-1?Q?Abel_Mui=F1o_Vizcaino?=) Date: Tue, 26 Oct 2010 22:11:42 +0200 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> Message-ID: <3DD24306-ED42-41BF-B0DA-9D4BF829EE90@gmail.com> Trimmed the email for readability. El 26/10/2010, a las 21:40, Mike Dalessio escribi?: > Ok, let me rephrase what I think you're asking for, and you tell me if I'm correct. > > You'd like to ignore tags like: > > You'd like to follow tags like: > > That is, if content contains a URL to be followed, you'd like to follow it, but otherwise, not. > > Is that correct? Yes, that's correct, with an extra: if the url is the current url, don't follow it. Example: mech.get("http://sample.net/") Should not follow (or sleep on) tags like: -- Abel Mui?o -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike at csa.net Tue Oct 26 16:25:25 2010 From: mike at csa.net (Mike Dalessio) Date: Tue, 26 Oct 2010 16:25:25 -0400 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: <3DD24306-ED42-41BF-B0DA-9D4BF829EE90@gmail.com> References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> <3DD24306-ED42-41BF-B0DA-9D4BF829EE90@gmail.com> Message-ID: On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino wrote: > Trimmed the email for readability. > > El 26/10/2010, a las 21:40, Mike Dalessio escribi?: > > Ok, let me rephrase what I think you're asking for, and you tell me if I'm > correct. > > You'd like to ignore tags like: > > You'd like to follow tags like: > > That is, if content contains a URL to be followed, you'd like to follow it, > but otherwise, not. > > Is that correct? > > > > Yes, that's correct, with an extra: if the url is the current url, don't > follow it. > Example: > > mech.get("http://sample.net/") > > > Should not follow (or sleep on) tags like: > > > > -- > Abel Mui?o > This seems like reasonable behavior, which could be made the Mechanize default behavior. Does anybody have objections to changing Mechanize's behavior to *not* follow meta refreshes that are either for the current page URI or contain no URI? -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at 7fff.com Tue Oct 26 18:05:18 2010 From: john at 7fff.com (John Norman) Date: Tue, 26 Oct 2010 17:05:18 -0500 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> <3DD24306-ED42-41BF-B0DA-9D4BF829EE90@gmail.com> Message-ID: No objection here . . . but: The only reason I can imagine to keep requesting such URLs is for a web site that (for whatever reason) counts the refreshes and does something based on that count, or otherwise depends on a state change provoked by the refreshes. If the URL is for the current page URI, it almost seems worthy of an exception. On Tue, Oct 26, 2010 at 3:25 PM, Mike Dalessio wrote: > > > On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino wrote: > >> Trimmed the email for readability. >> >> El 26/10/2010, a las 21:40, Mike Dalessio escribi?: >> >> Ok, let me rephrase what I think you're asking for, and you tell me if I'm >> correct. >> >> You'd like to ignore tags like: >> >> You'd like to follow tags like: >> >> That is, if content contains a URL to be followed, you'd like to follow >> it, but otherwise, not. >> >> Is that correct? >> >> >> >> Yes, that's correct, with an extra: if the url is the current url, don't >> follow it. >> Example: >> >> mech.get("http://sample.net/") >> >> >> Should not follow (or sleep on) tags like: >> >> >> >> -- >> Abel Mui?o >> > > This seems like reasonable behavior, which could be made the Mechanize > default behavior. > > Does anybody have objections to changing Mechanize's behavior to *not* > follow meta refreshes that are either for the current page URI or contain no > URI? > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amuino at gmail.com Wed Oct 27 10:50:27 2010 From: amuino at gmail.com (=?iso-8859-1?Q?Abel_Mui=F1o_Vizcaino?=) Date: Wed, 27 Oct 2010 16:50:27 +0200 Subject: [Mechanize-users] Infinite loop on meta refresh In-Reply-To: References: <093BBCB5-D5C7-40A9-AC67-93FC4B57A2CA@gmail.com> <69EA3C59-26DA-43F9-B306-01B4EDA75876@gmail.com> <3DD24306-ED42-41BF-B0DA-9D4BF829EE90@gmail.com> Message-ID: El 27/10/2010, a las 00:05, John Norman escribi?: > No objection here . . . but: > > The only reason I can imagine to keep requesting such URLs is for a web site that (for whatever reason) counts the refreshes and does something based on that count, or otherwise depends on a state change provoked by the refreshes. As a developer, I have used this "refresh to self" strategy to wait for a background job (maybe display some progress information too). So, in order to keep Mechanize useful for end-to-end testing of webapps, this scenario should be supported. > If the URL is for the current page URI, it almost seems worthy of an exception. I'm not sure about the exception. Some kind of information would be nice (specially if Mech apparently "hangs" when following a ). And adding a feature to avoid or limit waiting. But that's probably a different topic. Just allowing an option to ignore or follow the meta redirection to the same page would be nice. > > On Tue, Oct 26, 2010 at 3:25 PM, Mike Dalessio wrote: > > > On Tue, Oct 26, 2010 at 4:11 PM, Abel Mui?o Vizcaino wrote: > Trimmed the email for readability. > > El 26/10/2010, a las 21:40, Mike Dalessio escribi?: > >> Ok, let me rephrase what I think you're asking for, and you tell me if I'm correct. >> >> You'd like to ignore tags like: >> >> You'd like to follow tags like: >> >> That is, if content contains a URL to be followed, you'd like to follow it, but otherwise, not. >> >> Is that correct? > > > Yes, that's correct, with an extra: if the url is the current url, don't follow it. > Example: > mech.get("http://sample.net/") > > Should not follow (or sleep on) tags like: > > -- > Abel Mui?o > > This seems like reasonable behavior, which could be made the Mechanize default behavior. > > Does anybody have objections to changing Mechanize's behavior to *not* follow meta refreshes that are either for the current page URI or contain no URI? > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: