From selltrino at gmail.com Tue Apr 29 16:23:26 2008 From: selltrino at gmail.com (Sell Trino) Date: Tue, 29 Apr 2008 13:23:26 -0700 Subject: [Mechanize-users] Intercepting an onClick file download Message-ID: Hi, I'm having some trouble downloading a .csv file from a particular website. The file isn't part of a url, you need to click on a link in order to get the file sent. I don't know how to get mechanize to correctly identify that. Here is the link to the file I'm trying to retrieve: Export to CVS Here is my code (partial): agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } agent.keep_alive = false agent.read_timeout = 60 # the page would timeout sometimes url = "https://website.com/page.php4" page = agent.get(url) page.links.text(/Export to CVS/).each { |link| file_page = agent.click(link) file_page.save_as('output.csv') return } What I get in output.csv is just the original page, not the .csv file. If someone could please help me understand how I can nab the file contents instead, I'd greatly appreciate it. (I actually want to eventually parse the csv within the code, not just save it) Thanks! From mat.schaffer at gmail.com Tue Apr 29 16:34:36 2008 From: mat.schaffer at gmail.com (Mat Schaffer) Date: Tue, 29 Apr 2008 16:34:36 -0400 Subject: [Mechanize-users] Intercepting an onClick file download In-Reply-To: References: Message-ID: <15C73D4E-4B5C-44B4-9D98-736C53043B35@gmail.com> Mechanize would need javascript support to make this work. Which I'm pretty sure it doesn't have. Maybe Aaron has some trick up his sleeve though, I dunno. What I usually do in this cases is manually trace the javascript (in this case the dataExport function) using firebug and the firefox web developer toolbar. Once I get a handle on what the javascript is doing, I replicate that in ruby to build the appropriate URL and finally just use mechanize to make the GET request. Good luck with your project! -Mat On Apr 29, 2008, at 4:23 PM, Sell Trino wrote: > Hi, > > I'm having some trouble downloading a .csv file from a particular > website. The file isn't part of a url, you need to click on a link in > order to get the file sent. I don't know how to get mechanize to > correctly identify that. > > Here is the link to the file I'm trying to retrieve: > > > src="/img/buttons/bu_csv.gif" width="37" height="17" style="border: > none;" alt="Export to CVS"> > > > Here is my code (partial): > > agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } > agent.keep_alive = false > agent.read_timeout = 60 # the page would timeout sometimes > url = "https://website.com/page.php4" > page = agent.get(url) > page.links.text(/Export to CVS/).each { |link| > file_page = agent.click(link) > file_page.save_as('output.csv') > return > } > > What I get in output.csv is just the original page, not the .csv file. > If someone could please help me understand how I can nab the file > contents instead, I'd greatly appreciate it. (I actually want to > eventually parse the csv within the code, not just save it) > > Thanks! > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users From whitethunder922 at yahoo.com Tue Apr 29 16:41:46 2008 From: whitethunder922 at yahoo.com (Matt White) Date: Tue, 29 Apr 2008 13:41:46 -0700 (PDT) Subject: [Mechanize-users] Intercepting an onClick file download Message-ID: <303798.66481.qm@web53301.mail.re2.yahoo.com> Sell, As Mechanize doesn't interpret Javascript, you will need to dissect the function "dataExport". If you need help with that, paste the source for that function here and perhaps we can help more. Or if the website is publicly accessible, let us know the URL and we can take a look at it. Matt White ----- Original Message ---- From: Sell Trino To: mechanize-users at rubyforge.org Sent: Tuesday, April 29, 2008 2:23:26 PM Subject: [Mechanize-users] Intercepting an onClick file download Hi, I'm having some trouble downloading a .csv file from a particular website. The file isn't part of a url, you need to click on a link in order to get the file sent. I don't know how to get mechanize to correctly identify that. Here is the link to the file I'm trying to retrieve: Export to CVS Here is my code (partial): agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } agent.keep_alive = false agent.read_timeout = 60 # the page would timeout sometimes url = "https://website.com/page.php4" page = agent.get(url) page.links.text(/Export to CVS/).each { |link| file_page = agent.click(link) file_page.save_as('output.csv') return } What I get in output.csv is just the original page, not the .csv file. If someone could please help me understand how I can nab the file contents instead, I'd greatly appreciate it. (I actually want to eventually parse the csv within the code, not just save it) Thanks! _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ -------------- next part -------------- An HTML attachment was scrubbed... URL: From riddochc at gmail.com Tue Apr 29 17:06:16 2008 From: riddochc at gmail.com (Chris Riddoch) Date: Tue, 29 Apr 2008 15:06:16 -0600 Subject: [Mechanize-users] Intercepting an onClick file download In-Reply-To: References: Message-ID: <6efbd9b70804291406y35725dc3p36e64c77533c04ee@mail.gmail.com> This sort of question is clearly frequent enough to warrant documenting. Expect a patch from me soon for this... -- epistemological humility Chris Riddoch From selltrino at gmail.com Tue Apr 29 18:29:28 2008 From: selltrino at gmail.com (Sell Trino) Date: Tue, 29 Apr 2008 15:29:28 -0700 Subject: [Mechanize-users] Intercepting an onClick file download In-Reply-To: <6efbd9b70804291406y35725dc3p36e64c77533c04ee@mail.gmail.com> References: <6efbd9b70804291406y35725dc3p36e64c77533c04ee@mail.gmail.com> Message-ID: Thanks guys for the feedback. I understand now the issue about this being javascript. I installed Firebug and the Firefox Web Developer toolbar (which look very helpful, btw) and got a full dump of the javascript code (the dataExport script wasn't shown on a dump of the page source until this tool dug it out) and here is the code for it: /** * Initiate Export, type is either csv or xls **/ function dataExport( type ) { document.getElementById('export').value = 'Y'; document.getElementById('exportType').value = type; document.getElementById('selectform').method = 'POST'; switchTarget('self'); document.getElementById('export').value = ''; document.getElementById('exportType').value = ''; } // dataExport I then made my url "https://website.com/page.php4?export=Y&exportType=csv" and did a get on that and it worked! (apparently their server doesn't require it to be a post...) Thanks everyone for the help! One last thing though, when I get the page, page.body.class = 'String'. I had setup the CSVParser via: class CSVParser < WWW::Mechanize::File attr_reader :csv def initialize(uri=nil, response=nil, body=nil, code=nil) super(uri, response, body, code) @csv = CSV.parse(body) end end agent = WWW::Mechanize.new agent.pluggable_parser.csv = CSVParser And it doesn't seem to autorecognize the file as CSV. I think it's because the content encoding is gzip, as per the log file: response-header: vary => User-Agent,Accept-Encoding response-header: cache-control => must-revalidate, post-check=0,pre-check=0 response-header: connection => close response-header: x-cache => MISS from 284720 response-header: expires => 0 response-header: content-type => application/octetstream response-header: date => Tue, 29 Apr 2008 22:23:15 GMT response-header: content-encoding => gzip response-header: content-disposition => attachment; filename=file.csv response-header: server => Apache response-header: content-length => 5837 response-header: pragma => public gunzip body Not like it's a big deal to just CSV.pars(page.body), but just wondering if I'm write in why it didn't recognize and parse this automatically as .csv Thanks! On Tue, Apr 29, 2008 at 2:06 PM, Chris Riddoch wrote: > This sort of question is clearly frequent enough to warrant > documenting. Expect a patch from me soon for this... > > -- > epistemological humility > Chris Riddoch > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > From aaron.patterson at gmail.com Tue Apr 29 19:05:36 2008 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 29 Apr 2008 16:05:36 -0700 Subject: [Mechanize-users] Intercepting an onClick file download In-Reply-To: References: <6efbd9b70804291406y35725dc3p36e64c77533c04ee@mail.gmail.com> Message-ID: <6959e1680804291605y2e4acaao96f537e2d1463e89@mail.gmail.com> On Tue, Apr 29, 2008 at 3:29 PM, Sell Trino wrote: > Thanks guys for the feedback. I understand now the issue about this > being javascript. > > I installed Firebug and the Firefox Web Developer toolbar (which look > very helpful, btw) and got a full dump of the javascript code (the > dataExport script wasn't shown on a dump of the page source until this > tool dug it out) and here is the code for it: > > /** > * Initiate Export, type is either csv or xls > **/ > function dataExport( type ) { > > document.getElementById('export').value = 'Y'; > document.getElementById('exportType').value = type; > document.getElementById('selectform').method = 'POST'; > > switchTarget('self'); > > document.getElementById('export').value = ''; > document.getElementById('exportType').value = ''; > > } // dataExport > > I then made my url > "https://website.com/page.php4?export=Y&exportType=csv" and did a get > on that and it worked! (apparently their server doesn't require it to > be a post...) Thanks everyone for the help! > > One last thing though, when I get the page, page.body.class = > 'String'. I had setup the CSVParser via: > > class CSVParser < WWW::Mechanize::File > attr_reader :csv > def initialize(uri=nil, response=nil, body=nil, code=nil) > super(uri, response, body, code) > @csv = CSV.parse(body) > end > end > agent = WWW::Mechanize.new > agent.pluggable_parser.csv = CSVParser > > And it doesn't seem to autorecognize the file as CSV. I think it's > because the content encoding is gzip, as per the log file: > > response-header: vary => User-Agent,Accept-Encoding > response-header: cache-control => must-revalidate, post-check=0,pre-check=0 > response-header: connection => close > response-header: x-cache => MISS from 284720 > response-header: expires => 0 > response-header: content-type => application/octetstream > response-header: date => Tue, 29 Apr 2008 22:23:15 GMT > response-header: content-encoding => gzip > response-header: content-disposition => attachment; filename=file.csv > response-header: server => Apache > response-header: content-length => 5837 > response-header: pragma => public > gunzip body > > Not like it's a big deal to just CSV.pars(page.body), but just > wondering if I'm write in why it didn't recognize and parse this > automatically as .csv Mechanize uses the content-type header to determine which parser to use. The response header indicated 'application/octetstream' which doesn't really give any hints as to the type of data you are receiving. -- Aaron Patterson http://tenderlovemaking.com/ From aaron.patterson at gmail.com Tue Apr 29 19:08:46 2008 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 29 Apr 2008 16:08:46 -0700 Subject: [Mechanize-users] Intercepting an onClick file download In-Reply-To: <15C73D4E-4B5C-44B4-9D98-736C53043B35@gmail.com> References: <15C73D4E-4B5C-44B4-9D98-736C53043B35@gmail.com> Message-ID: <6959e1680804291608l7694996fifb954e9cdff53c43@mail.gmail.com> On Tue, Apr 29, 2008 at 1:34 PM, Mat Schaffer wrote: > Mechanize would need javascript support to make this work. Which I'm pretty > sure it doesn't have. Maybe Aaron has some trick up his sleeve though, I > dunno. Not yet. I'm working on it though..... See these: http://tenderlovemaking.com/2008/04/23/take-it-to-the-limit-one-more-time/ http://github.com/jbarnette/johnson/tree/master Unfortunately this project isn't my day job. ;-) -- Aaron Patterson http://tenderlovemaking.com/