From gabe at mudbugmedia.com Thu Dec 6 11:04:45 2007 From: gabe at mudbugmedia.com (Gabe Martin-Dempesy) Date: Thu, 6 Dec 2007 10:04:45 -0600 Subject: [Mechanize-users] Too many open files leads to timeout exceptions in Mechanize/Net::HTTP? Message-ID: <874121C1-F7F6-42B0-A6C1-EE0277BD16FE@mudbugmedia.com> I'm experiencing an issue where my get/submit calls to the Mechanize agent are leading to timeout exceptions when the ruby script has too many open file descriptors ( > 1000). However, I'm not seeing anything about an overstep violation sent anywhere to syslog, and no error message sent to stdout. The process has the large amount of open files because it's executed by PHP from an Apache install with > 500 VHosts (with two logs for each vhost), and all of the open FD's get inherited into the ruby process (as reported by lsof). I can get the script operating properly by commenting out several of the logs in my Apache install, but I'd prefer to get to the heart of what wall Ruby/Mechanize/Net::HTTP is running into (and not logging/reporting). Since the cause of this is too many open files, I've tried the following: * Setting the hard and soft ulimits in /etc/limits and /etc/security/ limits.conf (however if there was an overstep, grsecurity should have logged this to the syslog/dmesg) * # cat /proc/sys/fs/file-max 205126 * Recompiled ruby (ruby 1.8.6 (2007-09-24 patchlevel 111) [i686- linux]) after updating the ulimits (this was something recommended in a similar situation with Squid proxy, although I didn't see anything mentioned about FD_SETSIZE or similar) What I'm imagining is that the default FD_SETSIZE defined in the linux headers is what's getting hit, although I've heard that modifying those headers to increase the value can lead to instability in some services. I'm hoping that someone can tell me: * Which limit the process is hitting, and how to extend that limit * Is this potentially a bug in Ruby that should go upstream? * If there's any sane way to discard all the inherited open FD's that it inherits * (and a long-shot non-ruby PHP question) if there's any way I can have PHP execute the ruby script and have it not inherit the FD's == References == Sample driver of the .php script that executes the ruby script: > $id = 108; // hard-coded test > $command = > 'ruby ' . > escapeshellarg('/www/CLIENT/htdocs/include/script/ > nysif_scrape/run.rb') . ' ' . > escapeshellarg($id) . > ' >/dev/null &'; // /dev/null redir is needed to keep > program in the background > > system($command); > ?> Here's the backtrace on the exception that gets thrown on the timeout: > /usr/lib/ruby/1.8/timeout.rb:54:in `rbuf_fill' > /usr/lib/ruby/1.8/timeout.rb:56:in `timeout' > /usr/lib/ruby/1.8/timeout.rb:76:in `timeout' > /usr/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill' > /usr/lib/ruby/1.8/net/protocol.rb:116:in `readuntil' > /usr/lib/ruby/1.8/net/protocol.rb:126:in `readline' > /usr/lib/ruby/1.8/net/http.rb:2029:in `read_status_line' > /usr/lib/ruby/1.8/net/http.rb:2018:in `read_new' > /usr/lib/ruby/1.8/net/http.rb:1059:in `request' > /usr/lib/ruby/gems/1.8/gems/mechanize-0.6.10/lib/mechanize.rb:514:in > `fetch_page' > /usr/lib/ruby/gems/1.8/gems/mechanize-0.6.10/lib/mechanize.rb:185:in > `get' > /www/CLIENT/htdocs/include/script/nysif_scrape/lib/AuthedAgent.rb: > 18:in `initialize' > /usr/lib/ruby/1.8/singleton.rb:95:in `new' > /usr/lib/ruby/1.8/singleton.rb:95:in `instance' > www/CLIENT/htdocs/include/script/nysif_scrape/run.rb:20 For reference, here's my AuthedAgent class definition. It's a Singleton providing access to an Agent that has already logged into the Login form on the site, and $import is just an ActiveRecord row to allow for ajax-y progress bar updates on the web-side. It times out on either the 'get' or the 'submit' method calls. > class AuthedAgent < WWW::Mechanize > include Singleton > > def initialize > super > > $import.status = 'connecting'; > $import.save! > login_page = get($config[:urls][:login]) > raise SiteChangedException, "Login page is 404" if (! > login_page.title.nil? && login_page.title.include?("The page cannot > be found")) > > $import.status = 'authenticating'; > $import.save! > form = login_page.form('frmLogin0') > raise SiteChangedException, "Could not locate form > 'frmLogin0' on login page" if form.nil? > form.fields.name("LOGIN").value = $config[:username] > form.fields.name("PWD").value = $config[:password] > post_login = submit(form) > raise SiteChangedException, "Username and password do not > appear valid. Check credentials, or possible site change." if ! > authenticated? > end > > def authenticated? > cookie = cookies.find { |c| c.name == "LOGIN" } > raise SiteChangedException, "'LOGIN' cookie is no longer > being set" if cookie.nil? > cookie.value.include? "TAG3=" + $config[:username] > end > end > Gabe Martin-Dempesy Mudbug Media 504-212-2161 // 504-581-INFO x201 From aaron at tenderlovemaking.com Sun Dec 9 22:06:05 2007 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Sun, 9 Dec 2007 19:06:05 -0800 Subject: [Mechanize-users] Road to 0.7.0 Message-ID: <20071210030605.GA13439@mac-mini.lan> Hey everyone, I've been refactoring Mechanize for an 0.7.0 release. Basically I'm trying to clean the code up and there are a few features that I think are unnecessary, but I would like to ask people first. 1) REXML as a parser. I want to remove support for REXML. I don't use it. Hpricot seems to do everything I need. 2) 1.8.2 thru 1.8.4 support I've got a bunch of monkey patches for 1.8.2 thru 1.8.4. I'd like to remove these because I think most people are on 1.8.5 or up. 3) Page#watch_for_set I am going to remove this method. It made sense when REXML was the main parser, since REXML was so slow. I think that Hpricot is fast enough that this method is not so useful. I'm going to make 0.7.0 lazily build up form and link objects, which should give everyone a slight speed increase but makes watch_for_set obsolete (sort of). I'm changing around the class names to be better organized, but they should all have the same methods. Also, if there are any feature requests, let me know! -- Aaron Patterson http://tenderlovemaking.com/ From barjunk at attglobal.net Mon Dec 10 15:32:08 2007 From: barjunk at attglobal.net (barsalou) Date: Mon, 10 Dec 2007 11:32:08 -0900 Subject: [Mechanize-users] Road to 0.7.0 In-Reply-To: <20071210030605.GA13439@mac-mini.lan> References: <20071210030605.GA13439@mac-mini.lan> Message-ID: <20071210113208.o81fisiqtcwk0484@lcgalaska.com> Quoting Aaron Patterson : > Hey everyone, > > I've been refactoring Mechanize for an 0.7.0 release. Basically I'm > trying to clean the code up and there are a few features that I think > are unnecessary, but I would like to ask people first. > > 1) REXML as a parser. > > I want to remove support for REXML. I don't use it. Hpricot seems to > do everything I need. > > 2) 1.8.2 thru 1.8.4 support > > I've got a bunch of monkey patches for 1.8.2 thru 1.8.4. I'd like to > remove these because I think most people are on 1.8.5 or up. I'm only one person, but I'm still on 1.8.4 (this assumes you are talking about Ruby 1.8.4). Doesn't seem likely that we would be upgrading to 1.8.5 anytime soon. I guess I better get with the program! :) > > 3) Page#watch_for_set > > I am going to remove this method. It made sense when REXML was the main > parser, since REXML was so slow. I think that Hpricot is fast enough > that this method is not so useful. > > I'm going to make 0.7.0 lazily build up form and link objects, which > should give everyone a slight speed increase but makes watch_for_set > obsolete (sort of). > > I'm changing around the class names to be better organized, but they > should all have the same methods. Also, if there are any feature > requests, let me know! I'm interested in the possibility of interacting with Java scripts...not sure if that is even possible though. I thought someone had mentioned it in the past, but since you were asking... :) Mike B. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From aaron at tenderlovemaking.com Mon Dec 10 19:51:08 2007 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Mon, 10 Dec 2007 16:51:08 -0800 Subject: [Mechanize-users] Road to 0.7.0 In-Reply-To: <20071210113208.o81fisiqtcwk0484@lcgalaska.com> References: <20071210030605.GA13439@mac-mini.lan> <20071210113208.o81fisiqtcwk0484@lcgalaska.com> Message-ID: <20071211005108.GA25749@mac-mini.lan> On Mon, Dec 10, 2007 at 11:32:08AM -0900, barsalou wrote: > Quoting Aaron Patterson : > >> Hey everyone, >> >> I've been refactoring Mechanize for an 0.7.0 release. Basically I'm >> trying to clean the code up and there are a few features that I think >> are unnecessary, but I would like to ask people first. >> >> 1) REXML as a parser. >> >> I want to remove support for REXML. I don't use it. Hpricot seems to >> do everything I need. >> >> 2) 1.8.2 thru 1.8.4 support >> >> I've got a bunch of monkey patches for 1.8.2 thru 1.8.4. I'd like to >> remove these because I think most people are on 1.8.5 or up. > > I'm only one person, but I'm still on 1.8.4 (this assumes you are talking > about Ruby 1.8.4). Doesn't seem likely that we would be upgrading to 1.8.5 > anytime soon. > > I guess I better get with the program! :) Okay, I'll leave my ruby 1.8.4 monkey patches in. I'm going to add deprecation warnings though. > >> >> 3) Page#watch_for_set >> >> I am going to remove this method. It made sense when REXML was the main >> parser, since REXML was so slow. I think that Hpricot is fast enough >> that this method is not so useful. >> >> I'm going to make 0.7.0 lazily build up form and link objects, which >> should give everyone a slight speed increase but makes watch_for_set >> obsolete (sort of). >> >> I'm changing around the class names to be better organized, but they >> should all have the same methods. Also, if there are any feature >> requests, let me know! > > I'm interested in the possibility of interacting with Java scripts...not > sure if that is even possible though. > > I thought someone had mentioned it in the past, but since you were > asking... :) Yes. I'm still working on it. Mechanize 0.7.0 is a midpoint. I'm refactoring it in preparation for 1.0.0. I'm planning on 1.0.0 to have javascript support, but it will have some (maybe many!) api incompatibilities. I want JS support too, so I'm working on it! -- Aaron Patterson http://tenderlovemaking.com/ From barjunk at attglobal.net Tue Dec 11 01:17:29 2007 From: barjunk at attglobal.net (barsalou) Date: Mon, 10 Dec 2007 21:17:29 -0900 Subject: [Mechanize-users] Road to 0.7.0 In-Reply-To: <20071211005108.GA25749@mac-mini.lan> References: <20071210030605.GA13439@mac-mini.lan> <20071210113208.o81fisiqtcwk0484@lcgalaska.com> <20071211005108.GA25749@mac-mini.lan> Message-ID: <20071210211729.0u7k9ux1wo8sosgs@lcgalaska.com> Quoting Aaron Patterson : > On Mon, Dec 10, 2007 at 11:32:08AM -0900, barsalou wrote: >> Quoting Aaron Patterson : >> > Okay, I'll leave my ruby 1.8.4 monkey patches in. I'm going to add > deprecation warnings though. > Thanks. I'll work on powers that be to move to 1.8.6. >> > > Yes. I'm still working on it. Mechanize 0.7.0 is a midpoint. I'm > refactoring it in preparation for 1.0.0. I'm planning on 1.0.0 to have > javascript support, but it will have some (maybe many!) api > incompatibilities. > > I want JS support too, so I'm working on it! Mechanize is a good library. Thanks for creating it. Mike B. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.