From wflanagan at gmail.com Wed Jan 3 19:20:10 2007 From: wflanagan at gmail.com (William Flanagan) Date: Wed, 03 Jan 2007 19:20:10 -0500 Subject: [Mechanize-users] Help accessing http headers? Message-ID: Hi, I'm using Mechanize, and I've developed a lot of code around it. I'd like to be able to check the Etag header during a get to see if the page has changed, as well as some other http header information. Can I do that without hacking Mechanize myself? Does anyone have any examples of how to do this? William From aaron_patterson at speakeasy.net Wed Jan 3 21:35:48 2007 From: aaron_patterson at speakeasy.net (Aaron Patterson) Date: Wed, 3 Jan 2007 18:35:48 -0800 Subject: [Mechanize-users] Help accessing http headers? In-Reply-To: References: Message-ID: <20070104023548.GA5051@eviladmins.lan> On Wed, Jan 03, 2007 at 07:20:10PM -0500, William Flanagan wrote: > Hi, > > I'm using Mechanize, and I've developed a lot of code around it. I'd like > to be able to check the Etag header during a get to see if the page has > changed, as well as some other http header information. Can I do that > without hacking Mechanize myself? > > Does anyone have any examples of how to do this? Sure. You can access the response from the page object. Here's an example: page = WWW::Mechanize.new().get('http://www.google.com/') page.header.each_header do |k,v| puts "#{k} #{v}" end Hope that helps! -- Aaron Patterson http://tenderlovemaking.com/ From wflanagan at gmail.com Wed Jan 3 20:53:07 2007 From: wflanagan at gmail.com (William Flanagan) Date: Wed, 03 Jan 2007 20:53:07 -0500 Subject: [Mechanize-users] Help accessing http headers? In-Reply-To: <20070104023548.GA5051@eviladmins.lan> Message-ID: Aaron, thanks. Just checked it and that did the trick! Love the blog name by the way. :-) William On 1/3/07 9:35 PM, "Aaron Patterson" wrote: > > Sure. You can access the response from the page object. Here's an > example: > > page = WWW::Mechanize.new().get('http://www.google.com/') > > page.header.each_header do |k,v| > puts "#{k} #{v}" > end > > Hope that helps! From dsisnero at gmail.com Thu Jan 4 12:22:16 2007 From: dsisnero at gmail.com (Dominic Sisneros) Date: Thu, 4 Jan 2007 10:22:16 -0700 Subject: [Mechanize-users] Help accessing http headers? In-Reply-To: <20070104023548.GA5051@eviladmins.lan> References: <20070104023548.GA5051@eviladmins.lan> Message-ID: Is there any way to have a get option that uses the etag or not-modified header to not got a file if it hasn't changed. This will cut down on bandwidth usage if WWW::Mechanize.new().get('http://www/some_huge_file_infrequently_changed",:etag => etag, :updated => cached_time) do |page| else puts "not modified" end On 1/3/07, Aaron Patterson wrote: > > On Wed, Jan 03, 2007 at 07:20:10PM -0500, William Flanagan wrote: > > Hi, > > > > I'm using Mechanize, and I've developed a lot of code around it. I'd > like > > to be able to check the Etag header during a get to see if the page has > > changed, as well as some other http header information. Can I do that > > without hacking Mechanize myself? > > > > Does anyone have any examples of how to do this? > > Sure. You can access the response from the page object. Here's an > example: > > page = WWW::Mechanize.new().get('http://www.google.com/') > > page.header.each_header do |k,v| > puts "#{k} #{v}" > end > > Hope that helps! > > -- > Aaron Patterson > http://tenderlovemaking.com/ > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20070104/4458476a/attachment.html From wflanagan at gmail.com Thu Jan 4 14:08:49 2007 From: wflanagan at gmail.com (William Flanagan) Date: Thu, 04 Jan 2007 14:08:49 -0500 Subject: [Mechanize-users] Help accessing http headers? In-Reply-To: Message-ID: That?s a good question. I wanted to read the tag, so that I could do the analysis myself. To conserve bandwidth, you?d have to put the tag and post it in the initial get. I don?t know how to do this. Does anyone have sample code to do this, for the sake of the ?Internet?s? completeness? William On 1/4/07 12:22 PM, "Dominic Sisneros" wrote: > Is there any way to have a get option that uses the etag or not-modified > header to not got a file if it hasn't changed. > > This will cut down on bandwidth usage > > if WWW::Mechanize.new().get(' http://www/some_huge_file_infrequently_changed > ",:etag => etag, :updated => > cached_time) do |page| > else > puts "not modified" > end >> >> Sure. You can access the response from the page object. Here's an >> example: >> >> page = WWW::Mechanize.new().get('http://www.google.com/' >> ) >> >> page.header.each_header do |k,v| >> puts "#{k} #{v}" >> end >> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20070104/30109d94/attachment-0001.html From aaron_patterson at speakeasy.net Thu Jan 4 15:56:25 2007 From: aaron_patterson at speakeasy.net (Aaron Patterson) Date: Thu, 4 Jan 2007 12:56:25 -0800 Subject: [Mechanize-users] Help accessing http headers? In-Reply-To: References: <20070104023548.GA5051@eviladmins.lan> Message-ID: <20070104205625.GA15253@eviladmins.lan> On Thu, Jan 04, 2007 at 10:22:16AM -0700, Dominic Sisneros wrote: > Is there any way to have a get option that uses the etag or not-modified > header to not got a file if it hasn't changed. > > This will cut down on bandwidth usage > > if > WWW::Mechanize.new().get('http://www/some_huge_file_infrequently_changed",:etag > => etag, :updated => cached_time) do |page| > else > puts "not modified" > end It is possible, but not very easy right now. You can subclass mechanize, and implement the "set_headers" method to add an If-Modified-Since header. Then you'd probably have to write a pluggable parser that deals with the response code by looking up the cached page. I was thinking of building this in to mechanize, but I didn't know if anyone wanted/needed it. How important is this to people? -- Aaron Patterson http://tenderlovemaking.com/ From trot.thunder at gmail.com Fri Jan 12 05:39:41 2007 From: trot.thunder at gmail.com (takumi iino) Date: Fri, 12 Jan 2007 19:39:41 +0900 Subject: [Mechanize-users] why dose to_absolute_uri use URI.escape? Message-ID: hello. This code is abort with Mechanize 0.6.4 . ---------------------------- # sample.rb require "rubygems" require "mechanize" agent = WWW::Mechanize.new agent.user_agent_alias='Windows Mozilla' # top page of wikipedia for japanese agent.get("http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8") ----------------------------- > ruby sample.rb ruby sample.rb C:/opt/ruby-1.8/lib/ruby/1.8/uri/common.rb:432:in `split': bad URI(is not URI?): http://ja.wikipedia.org/wiki/??????????? (URI::InvalidURIError) from C:/opt/ruby-1.8/lib/ruby/1.8/uri/common.rb:481:in `parse' from C:/opt/ruby-1.8/lib/ruby/gems/1.8/gems/mechanize-0.6.4/lib/mechanize.rb:272:in `to_absolute_uri' from C:/opt/ruby-1.8/lib/ruby/gems/1.8/gems/mechanize-0.6.4/lib/mechanize.rb:141:in `get' from sample.rb:6 to_absolute_uri in mechanize.rb url = URI.parse( URI.unescape(Util.html_unescape(url.to_s.strip)).gsub(/ /, '%20') ) unless url.is_a? URI This code cann't run with escaped multibyte character. Why URI.unescape( "uri" ).gsub(/ /, '%20') ? I guess URI.unescape( "uri" ).gsub(/ /, '%20') is not needed. url = URI.parse( Util.html_unescape(url.to_s.strip) ) unless url.is_a? URI --------- takumi From wflanagan at gmail.com Fri Jan 12 17:01:28 2007 From: wflanagan at gmail.com (William Flanagan) Date: Fri, 12 Jan 2007 17:01:28 -0500 Subject: [Mechanize-users] Single method call to retrieve the entire page in HTML? Message-ID: All, Another easy question. In Hpricot, on a doc that I am using, I can do a .to_html method and retrieve the entire page. However, this doesn't seem to work in Mechanize. My goal is to the text of the page and put it into a database to make it searchable with ferret (using the acts_as_ferret plugin in Rails). Does anyone have a good suggestion short of iterating over the entire document and grabbing individual texts? Thanks, William From aaron_patterson at speakeasy.net Fri Jan 12 18:56:44 2007 From: aaron_patterson at speakeasy.net (Aaron Patterson) Date: Fri, 12 Jan 2007 15:56:44 -0800 Subject: [Mechanize-users] Single method call to retrieve the entire page in HTML? In-Reply-To: References: Message-ID: <20070112235644.GA15940@eviladmins.lan> Hi William, On Fri, Jan 12, 2007 at 05:01:28PM -0500, William Flanagan wrote: > All, > > Another easy question. In Hpricot, on a doc that I am using, I can do a > .to_html method and retrieve the entire page. However, this doesn't seem to > work in Mechanize. You can get the html in a page by calling "body" on the page object. For example: mech = WWW::Mechanize.new page = mech.get('http://tenderlovemaking.com/') puts page.body Mechanize uses Hpricot to parse the html. If there is functionality on Hpricot that you would like to use, you can get a hold of the parser from the page object by calling the "root" method: puts page.root.class > > My goal is to the text of the page and put it into a database to make it > searchable with ferret (using the acts_as_ferret plugin in Rails). Does > anyone have a good suggestion short of iterating over the entire document > and grabbing individual texts? > > Thanks, > > William Hope that helps! -- Aaron Patterson http://tenderlovemaking.com/ From barjunk at attglobal.net Sat Jan 27 01:57:45 2007 From: barjunk at attglobal.net (barsalou) Date: Fri, 26 Jan 2007 21:57:45 -0900 Subject: [Mechanize-users] Getting elements from a web page Message-ID: <20070126215745.vi2dlu0aasws400c@lcgalaska.com> I am new to Mechanize and was wondering if there was a built-in method to get the elements that are on the page that are not part of a form. A couple of examples would be my banking site lists my entries and I want them to go into an array so that I can handle them. Or another site I use, does some categorization for me and I would like to manipulate it and present it differently to a user. I looked through some of the maillists and found something that Paul Lutus wrote that I should be able to use: array = data.scan(%r{

([^<]+?)

}) This piece of code will find all the paragraph tags that have an image associated with them. It's clear to me that Paul understands regular expressions well....unfortunately that is not me. I just wondered, with as easy Mechanize has been to use with forms and such, it seemed like there would be something I could use that would help me accomplish my task. While I'm hoping there is a method from within Mechanize, I'll start working on my regular expressions. BTW, if I wanted to create some documentation for Mechanize, how would I submit it? Mike B. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From schapht at gmail.com Sat Jan 27 12:05:11 2007 From: schapht at gmail.com (Mat Schaffer) Date: Sat, 27 Jan 2007 12:05:11 -0500 Subject: [Mechanize-users] Getting elements from a web page In-Reply-To: <20070126215745.vi2dlu0aasws400c@lcgalaska.com> References: <20070126215745.vi2dlu0aasws400c@lcgalaska.com> Message-ID: <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> On Jan 27, 2007, at 1:57 AM, barsalou wrote: > I am new to Mechanize and was wondering if there was a built-in method > to get the elements that are on the page that are not part of a form. > > A couple of examples would be my banking site lists my entries and I > want them to go into an array so that I can handle them. > > array = data.scan(%r{

([^<]+?)

}) This piece of code > will find all the paragraph tags that have an image associated with > them. > > BTW, if I wanted to create some documentation for Mechanize, how would > I submit it? I'm sure if you wanted to create more of a manual, just email it to this list and Aaron would probably be happy to have the help. But first, Mechanize has decent API documentation. You may not know how to get at the API docs though. Just run 'gem_server' on your local machine. Then browse to http://localhost:8808/. You'll see an [rdoc] link for Mechanize. Then just go to the WWW::Mechanize page for an overview of the package. This is pretty standard fare for most gems. Sadly there's nothing on the web that steers people to them. Anyway. Searching in mechanize is powered by hpricot. So anything that works in hpricot will also work on a mechanize Page. Sadly I don't know a real easy way to do your example. But I'd do something like this: page.search('p').find_all { |p| p.search('img') } There might be something easier. But say you were interested in all the img's that exist inside a table with id 'body'. That'd be: page.search('table#body img') Which is usually just the sort of thing I'm looking for. Anyway, check out: http://code.whytheluckystiff.net/doc/hpricot/ Which has more info about Hpricot (which is the magic behind WWW::Mechanize::Page) Hope that helps! -Mat From barjunk at attglobal.net Sat Jan 27 16:19:34 2007 From: barjunk at attglobal.net (barsalou) Date: Sat, 27 Jan 2007 12:19:34 -0900 Subject: [Mechanize-users] Getting elements from a web page In-Reply-To: <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> References: <20070126215745.vi2dlu0aasws400c@lcgalaska.com> <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> Message-ID: <20070127121934.uajf7xe96og48wk8@lcgalaska.com> Quoting Mat Schaffer : > I'm sure if you wanted to create more of a manual, just email it to > this list and Aaron would probably be happy to have the help. > > But first, Mechanize has decent API documentation. You may not know > how to get at the API docs though. Just run 'gem_server' on your > local machine. Then browse to http://localhost:8808/. You'll see an > [rdoc] link for Mechanize. Then just go to the WWW::Mechanize page > for an overview of the package. This is pretty standard fare for > most gems. Sadly there's nothing on the web that steers people to them. > > Anyway. Searching in mechanize is powered by hpricot. So anything > that works in hpricot will also work on a mechanize Page. > > Sadly I don't know a real easy way to do your example. But I'd do > something like this: > > page.search('p').find_all { |p| p.search('img') } > > There might be something easier. But say you were interested in all > the img's that exist inside a table with id 'body'. That'd be: > > page.search('table#body img') > > Which is usually just the sort of thing I'm looking for. > > Anyway, check out: > http://code.whytheluckystiff.net/doc/hpricot/ > I have found the API docs, but for a newbie who doesn't know anything about Hpricot and various ways to deal with web pages, I think more examples will be helpful. Thanks for the hints...I'll check them out and report back. Ruby and Mechanize(which includes Hpricot) makes working with HTML almost fun! :) Mike B. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From aaron_patterson at speakeasy.net Sun Jan 28 14:15:06 2007 From: aaron_patterson at speakeasy.net (Aaron Patterson) Date: Sun, 28 Jan 2007 11:15:06 -0800 Subject: [Mechanize-users] Getting elements from a web page In-Reply-To: <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> References: <20070126215745.vi2dlu0aasws400c@lcgalaska.com> <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> Message-ID: <20070128191506.GA10164@eviladmins.lan> On Sat, Jan 27, 2007 at 12:05:11PM -0500, Mat Schaffer wrote: > On Jan 27, 2007, at 1:57 AM, barsalou wrote: > > I am new to Mechanize and was wondering if there was a built-in method > > to get the elements that are on the page that are not part of a form. > > > > A couple of examples would be my banking site lists my entries and I > > want them to go into an array so that I can handle them. > > > > array = data.scan(%r{

([^<]+?)

}) This piece of code > > will find all the paragraph tags that have an image associated with > > them. > > > > BTW, if I wanted to create some documentation for Mechanize, how would > > I submit it? > > I'm sure if you wanted to create more of a manual, just email it to > this list and Aaron would probably be happy to have the help. Yes, I always welcome new documentation. Poor documentation really annoys me, so if something is missing or isn't clear, please let me know. > > But first, Mechanize has decent API documentation. You may not know > how to get at the API docs though. Just run 'gem_server' on your > local machine. Then browse to http://localhost:8808/. You'll see an > [rdoc] link for Mechanize. Then just go to the WWW::Mechanize page > for an overview of the package. This is pretty standard fare for > most gems. Sadly there's nothing on the web that steers people to them. Thank you! Also, you can find the documentation on the rubyforge website (although I think it is down right now): http://mechanize.rubyforge.org/ --Aaron -- Aaron Patterson http://tenderlovemaking.com/ From mbarsalou at lcgalaska.com Mon Jan 29 23:07:16 2007 From: mbarsalou at lcgalaska.com (Mike) Date: Mon, 29 Jan 2007 19:07:16 -0900 Subject: [Mechanize-users] Getting elements from a web page In-Reply-To: <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> References: <20070126215745.vi2dlu0aasws400c@lcgalaska.com> <1BB1EF11-8700-40BD-8840-6E9901A7FD29@gmail.com> Message-ID: <20070129190716.qtq9s19m8soc0g4c@lcgalaska.com> Just wanted to provide some feedback. Quoting Mat Schaffer : > Sadly I don't know a real easy way to do your example. But I'd do > something like this: > > page.search('p').find_all { |p| p.search('img') } This worked great..there was a lot more for me to learn and still struggling with how to organize this stuff in my head. Hopefully my examples below will shed some light on what more I need to learn. > > Anyway, check out: > http://code.whytheluckystiff.net/doc/hpricot/ This was helpful as well...especially if you first go to the README link. Also there is a reference to JQuery, which was also helpful. I realize that all the documentation is there and duplication of that documentation is a waste, but I believe more examples could help newer users get acclimated. However, Mechanize is the schizzle! (can I say that here :) ) The page that this code is for has two tables and the second table contains two rows of data with two data items for every "entry". Here is what I ended up doing: # more initialization code above this page = agent.submit(form) # divide the page into tables tables = page.search("table") # now break up the table into rows. rows = tables[1].search("tr") # the tested urls are stored in the testedurls array testedurls = rows.search("td:nth-child(0)") # the results from the tests are stored in urlresults urlresults = rows.search("td:nth-child(1)") i=1 while i < (testedurls.length + 1) i += 1 answer ="" unless urlresults[i].nil? then tmp,answer = urlresults[i].split(':') end if answer == " " then puts "The url: #{testedurls[i-1]} is not currently categorized" end end I know there are ways I can optimize the above code, but thought it better to provide the feedback. Thanks for giving me direction. Mike B. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.