From samuel.kadolph at jadedpixel.com Wed Jul 18 18:52:50 2012 From: samuel.kadolph at jadedpixel.com (Samuel Kadolph) Date: Wed, 18 Jul 2012 14:52:50 -0400 Subject: Unicorn is killing our rainbows workers Message-ID: Hey rainbows-talk, We have 40 servers that each run rainbows with 2 workers with 100 threads using ThreadPool. We're having an issue where unicorn is killing the worker process. We use ThreadTimeout (set to 70 seconds) and originally had the unicorn timeout set to 150 seconds and we're seeing unicorn eventually killing each worker. So we bumped the timeout to 300 seconds and it took about 5 minutes but we started seeing unicorn starting to kill workers again. You can see our stderr log file (timeout at 300s) at https://gist.github.com/9ec96922e55a59753997. Any insight into why unicorn is killing our ThreadPool workers would help us greatly. If you require additional info I would be happy to provide it. Samuel Kadolph samuel.kadolph at shopify.com 16134043579 From jlewis73 at jhu.edu Wed Jul 18 19:20:22 2012 From: jlewis73 at jhu.edu (Jason Lewis) Date: Wed, 18 Jul 2012 19:20:22 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: Message-ID: Sorry to add unproductive chatter, but this is the best subject line I've ever seen on a tech mailing list. :-) Jason On 2012-07-18 14:52 , "Samuel Kadolph" wrote: >Hey rainbows-talk, > >We have 40 servers that each run rainbows with 2 workers with 100 >threads using ThreadPool. We're having an issue where unicorn is >killing the worker process. We use ThreadTimeout (set to 70 seconds) >and originally had the unicorn timeout set to 150 seconds and we're >seeing unicorn eventually killing each worker. So we bumped the >timeout to 300 seconds and it took about 5 minutes but we started >seeing unicorn starting to kill workers again. You can see our stderr >log file (timeout at 300s) at >https://gist.github.com/9ec96922e55a59753997. Any insight into why >unicorn is killing our ThreadPool workers would help us greatly. If >you require additional info I would be happy to provide it. > >Samuel Kadolph >samuel.kadolph at shopify.com >16134043579 >_______________________________________________ >Rainbows! mailing list - rainbows-talk at rubyforge.org >http://rubyforge.org/mailman/listinfo/rainbows-talk >Do not quote signatures (like this one) or top post when replying From normalperson at yhbt.net Wed Jul 18 21:52:22 2012 From: normalperson at yhbt.net (Eric Wong) Date: Wed, 18 Jul 2012 21:52:22 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: Message-ID: <20120718215222.GA11539@dcvr.yhbt.net> Samuel Kadolph wrote: > Hey rainbows-talk, > > We have 40 servers that each run rainbows with 2 workers with 100 > threads using ThreadPool. We're having an issue where unicorn is > killing the worker process. We use ThreadTimeout (set to 70 seconds) > and originally had the unicorn timeout set to 150 seconds and we're > seeing unicorn eventually killing each worker. So we bumped the > timeout to 300 seconds and it took about 5 minutes but we started > seeing unicorn starting to kill workers again. You can see our stderr > log file (timeout at 300s) at > https://gist.github.com/9ec96922e55a59753997. Any insight into why > unicorn is killing our ThreadPool workers would help us greatly. If > you require additional info I would be happy to provide it. Which Ruby version/patchlevel are you using? 1.8 and 1.9 have vastly different thread implementations and workarounds to deal with. What C extensions are you using? ThreadTimeout might also be conflicting with some libraries you use and causing deadlocks. Also, ThreadTimeout might not be a good idea with many common libraries which: 1) use the stdlib Timeout internally 2) rely on ensure clauses firing ThreadTimeout turns out to be difficult to use correctly with existing code, so it may not be appropriate for you. Your app should use localized timeouts as much as possible (using timeout mechanisms built into libraries you use). Also, please don't use private gist (especially when posting to public mailing list), it requires a github account to clone from and I'll never require (nor encourage :P) needing any website account for contributing to Rainbows!, just an email address. From samuel.kadolph at shopify.com Wed Jul 18 23:06:07 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Wed, 18 Jul 2012 19:06:07 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120718215222.GA11539@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> Message-ID: On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> Hey rainbows-talk, >> >> We have 40 servers that each run rainbows with 2 workers with 100 >> threads using ThreadPool. We're having an issue where unicorn is >> killing the worker process. We use ThreadTimeout (set to 70 seconds) >> and originally had the unicorn timeout set to 150 seconds and we're >> seeing unicorn eventually killing each worker. So we bumped the >> timeout to 300 seconds and it took about 5 minutes but we started >> seeing unicorn starting to kill workers again. You can see our stderr >> log file (timeout at 300s) at >> https://gist.github.com/9ec96922e55a59753997. Any insight into why >> unicorn is killing our ThreadPool workers would help us greatly. If >> you require additional info I would be happy to provide it. > > Which Ruby version/patchlevel are you using? 1.8 and 1.9 have vastly > different thread implementations and workarounds to deal with. > > What C extensions are you using? > > ThreadTimeout might also be conflicting with some libraries you use and > causing deadlocks. Also, ThreadTimeout might not be a good idea with > many common libraries which: > > 1) use the stdlib Timeout internally > 2) rely on ensure clauses firing > > ThreadTimeout turns out to be difficult to use correctly with existing > code, so it may not be appropriate for you. Your app should use > localized timeouts as much as possible (using timeout mechanisms built > into libraries you use). > > Also, please don't use private gist (especially when posting to public > mailing list), it requires a github account to clone from and I'll > never require (nor encourage :P) needing any website account > for contributing to Rainbows!, just an email address. We're running ruby 1.9.3-p125 with the performance patches at https://gist.github.com/1688857. I listed the gems we use and which ones that have c extension at https://gist.github.com/3139226. We'll try running without the ThreadTimeout. We don't think we're having deadlock issues because our stress tests do not timeout but they do 502 when the rainbows worker gets killed during a request. From normalperson at yhbt.net Thu Jul 19 00:26:41 2012 From: normalperson at yhbt.net (Eric Wong) Date: Thu, 19 Jul 2012 00:26:41 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120718215222.GA11539@dcvr.yhbt.net> Message-ID: <20120719002641.GA17210@dcvr.yhbt.net> Samuel Kadolph wrote: > On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > >> Hey rainbows-talk, > >> > >> We have 40 servers that each run rainbows with 2 workers with 100 > >> threads using ThreadPool. We're having an issue where unicorn is > >> killing the worker process. We use ThreadTimeout (set to 70 seconds) > >> and originally had the unicorn timeout set to 150 seconds and we're > >> seeing unicorn eventually killing each worker. So we bumped the > >> timeout to 300 seconds and it took about 5 minutes but we started > >> seeing unicorn starting to kill workers again. You can see our stderr > >> log file (timeout at 300s) at > >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > >> unicorn is killing our ThreadPool workers would help us greatly. If > >> you require additional info I would be happy to provide it. Also, are you using "preload_app true" ? I'm a bit curious how these messages are happening, too: D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after suspend/hibernation Can you tell (from Rails logs) if the to-be-killed workers are still processing requests/responses the 300s before when the unicorn timeout hits it? AFAIK, Rails logs the PID of each worker processing the request. Also, what in your app takes 150s, or even 70s? I'm curious why the timeouts are so high. I wonder if there are bugs with unicorn/rainbows with huge timeout values, too... If anything, I'd lower the unicorn timeout to something low (maybe 5-10s) since that detects hard lockups at the VM level. Individual requests in Rainbows! _are_ allowed to take longer than the unicorn timeout. Can you reproduce this in a simulation environment or only with real traffic? If possible, can you setup an instance with a single worker process and get an strace ("strace -f") of all the threads when this happens? > We're running ruby 1.9.3-p125 with the performance patches at > https://gist.github.com/1688857. Can you reproduce this with an unpatched 1.9.3-p194? I'm not too familiar with the performance patches, but I'd like to reduce the amount of less-common/tested code to isolate the issue. > I listed the gems we use and which > ones that have c extension at https://gist.github.com/3139226. Fortunately, I'm familiar with nearly all of these C gems. Newer versions of mysql2 should avoid potential issues with ThreadTimeout/Timeout (or anything that hits Thread#kill). I think mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare (but possibly related to your issue) bug, Unrelated to your current issue, I strongly suggest Ruby 1.9.3-p194, previous versions had a nasty GC memory corruption bug triggered by Nokogiri (ref: https://github.com/tenderlove/nokogiri/issues/616) I also have no idea why mongrel is in there :x > We'll try running without the ThreadTimeout. We don't think we're > having deadlock issues because our stress tests do not timeout but > they do 502 when the rainbows worker gets killed during a request. OK. I'm starting to believe ThreadTimeout isn't good for the majority of applications out there, and perhaps the only way is to have support for this tightly coupled with the VM. Even then, "ensure" clauses would still be tricky/ugly to deal with... So maybe forcing developers to use app/library-level timeouts for everything they do is the only way. From samuel.kadolph at shopify.com Thu Jul 19 14:29:02 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Thu, 19 Jul 2012 10:29:02 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120719002641.GA17210@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> Message-ID: On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: >> > Samuel Kadolph wrote: >> >> Hey rainbows-talk, >> >> >> >> We have 40 servers that each run rainbows with 2 workers with 100 >> >> threads using ThreadPool. We're having an issue where unicorn is >> >> killing the worker process. We use ThreadTimeout (set to 70 seconds) >> >> and originally had the unicorn timeout set to 150 seconds and we're >> >> seeing unicorn eventually killing each worker. So we bumped the >> >> timeout to 300 seconds and it took about 5 minutes but we started >> >> seeing unicorn starting to kill workers again. You can see our stderr >> >> log file (timeout at 300s) at >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why >> >> unicorn is killing our ThreadPool workers would help us greatly. If >> >> you require additional info I would be happy to provide it. > > Also, are you using "preload_app true" ? Yes we are using preload_app true. > I'm a bit curious how these messages are happening, too: > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after > suspend/hibernation They are strange. My current hunch is the killing and that message are symptoms of the same issue. Since it always follows a killing. > Can you tell (from Rails logs) if the to-be-killed workers are still > processing requests/responses the 300s before when the unicorn timeout > hits it? AFAIK, Rails logs the PID of each worker processing the > request. rails doesn't log the pid but it would seem that after upgrading to mysql 0.2.18 it is no longer killing workers that are busy with requests. > Also, what in your app takes 150s, or even 70s? I'm curious why the > timeouts are so high. I wonder if there are bugs with unicorn/rainbows > with huge timeout values, too... > > If anything, I'd lower the unicorn timeout to something low (maybe > 5-10s) since that detects hard lockups at the VM level. Individual > requests in Rainbows! _are_ allowed to take longer than the unicorn > timeout. We lowered the unicorn timeout to 5 seconds and but that did not change the killings but they seem to be happening less often. I have some of our stderr logs after setting the timeout to 5 seconds at https://gist.github.com/3144250. > Can you reproduce this in a simulation environment or only with real > traffic? If possible, can you setup an instance with a single worker > process and get an strace ("strace -f") of all the threads when this > happens? We haven't been able to reproduce it locally. We have a staging environment for this app so I will see if I can use it and try to replicate it. >> We're running ruby 1.9.3-p125 with the performance patches at >> https://gist.github.com/1688857. > > Can you reproduce this with an unpatched 1.9.3-p194? I'm not too > familiar with the performance patches, but I'd like to reduce the amount > of less-common/tested code to isolate the issue. We cannot try p194 right now because one of our ops is on a trip but once he's back I'm sure we'll try that and let you know. >> I listed the gems we use and which >> ones that have c extension at https://gist.github.com/3139226. > > Fortunately, I'm familiar with nearly all of these C gems. > > Newer versions of mysql2 should avoid potential issues with > ThreadTimeout/Timeout (or anything that hits Thread#kill). I think > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare > (but possibly related to your issue) bug, Upgrading mysql2 seems to have stopped unicorn from killing workers that are currently busy. We were stress testing it last night and after we upgraded to 0.2.18 we had no more 502s from the app but this could be a coincidence since the killings are still happen. > Unrelated to your current issue, I strongly suggest Ruby 1.9.3-p194, > previous versions had a nasty GC memory corruption bug triggered > by Nokogiri (ref: https://github.com/tenderlove/nokogiri/issues/616) > > I also have no idea why mongrel is in there :x I forgot to only show bundle for production. >> We'll try running without the ThreadTimeout. We don't think we're >> having deadlock issues because our stress tests do not timeout but >> they do 502 when the rainbows worker gets killed during a request. > > OK. I'm starting to believe ThreadTimeout isn't good for the majority > of applications out there, and perhaps the only way is to have support > for this tightly coupled with the VM. Even then, "ensure" clauses would > still be tricky/ugly to deal with... So maybe forcing developers to use > app/library-level timeouts for everything they do is the only way. Our ops guys say we had this problem before we were using ThreadTimeout. From normalperson at yhbt.net Thu Jul 19 20:14:03 2012 From: normalperson at yhbt.net (Eric Wong) Date: Thu, 19 Jul 2012 13:14:03 -0700 Subject: [PATCH] thread_timeout: document additional caveats Message-ID: <20120719201403.GA11657@dcvr.yhbt.net> Again, for the one thousandth time, timing out threads is very tricky business :< --- Pushed to "master" of git://bogomips.org/rainbows and updated http://rainbows.rubyforge.org/Rainbows/ThreadTimeout.html lib/rainbows/thread_timeout.rb | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/lib/rainbows/thread_timeout.rb b/lib/rainbows/thread_timeout.rb index 4f62aba..8348272 100644 --- a/lib/rainbows/thread_timeout.rb +++ b/lib/rainbows/thread_timeout.rb @@ -44,6 +44,15 @@ # does not expose a monotonic clock for users, so don't change # the system time while this is running. All servers should be # running ntpd anyways. +# +# "ensure" clauses may not fire properly or be interrupted during +# execution, so do not mix this module with code which relies on "ensure". +# (This is also true for the "Timeout" module in the Ruby standard library) +# +# "recursive locking" ThreadError exceptions may occur if +# ThreadTimeout fires while a Mutex is locked (because "ensure" +# clauses may not fire properly). + class Rainbows::ThreadTimeout # :stopdoc: -- Eric Wong From normalperson at yhbt.net Thu Jul 19 20:16:33 2012 From: normalperson at yhbt.net (Eric Wong) Date: Thu, 19 Jul 2012 20:16:33 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> Message-ID: <20120719201633.GA8203@dcvr.yhbt.net> Samuel Kadolph wrote: > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > >> > Samuel Kadolph wrote: > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > >> >> unicorn is killing our ThreadPool workers would help us greatly. If > >> >> you require additional info I would be happy to provide it. > > > > Also, are you using "preload_app true" ? > > Yes we are using preload_app true. > > > I'm a bit curious how these messages are happening, too: > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after > > suspend/hibernation > > They are strange. My current hunch is the killing and that message are > symptoms of the same issue. Since it always follows a killing. I wonder if there's some background thread one of your gems spawns on load that causes the master to stall. I'm not seeing how else unicorn could think it was in suspend/hibernation. > > Can you tell (from Rails logs) if the to-be-killed workers are still > > processing requests/responses the 300s before when the unicorn timeout > > hits it? AFAIK, Rails logs the PID of each worker processing the > > request. > > rails doesn't log the pid but it would seem that after upgrading to > mysql 0.2.18 it is no longer killing workers that are busy with > requests. Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger is the default Rails logger :) > > If anything, I'd lower the unicorn timeout to something low (maybe > > 5-10s) since that detects hard lockups at the VM level. Individual > > requests in Rainbows! _are_ allowed to take longer than the unicorn > > timeout. > > We lowered the unicorn timeout to 5 seconds and but that did not > change the killings but they seem to be happening less often. I have > some of our stderr logs after setting the timeout to 5 seconds at > https://gist.github.com/3144250. Thanks for trying that! > > Newer versions of mysql2 should avoid potential issues with > > ThreadTimeout/Timeout (or anything that hits Thread#kill). I think > > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare > > (but possibly related to your issue) bug, > > Upgrading mysql2 seems to have stopped unicorn from killing workers > that are currently busy. We were stress testing it last night and > after we upgraded to 0.2.18 we had no more 502s from the app but this > could be a coincidence since the killings are still happen. Alright, good to know 0.2.18 solved your problems. Btw, have you noticed any general connectivity issues to your MySQL server? There were quite a few bugfixes from 0.2.6..0.2.18, though. Anyways, I'm happy your problem seems to be fixed with the mysql2 upgrade :) > Our ops guys say we had this problem before we were using ThreadTimeout. OK. That's somewhat reassuring to know (especially since the culprit seems to be an old mysql2 gem). I've had other users (privately) report issues with recursive locking because of ensure clauses (e.g. Mutex#synchronize) that I forgot to document. From samuel.kadolph at shopify.com Thu Jul 19 20:57:31 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Thu, 19 Jul 2012 16:57:31 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120719201633.GA8203@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> Message-ID: On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: > > > Samuel Kadolph wrote: > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > > >> > Samuel Kadolph wrote: > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If > > >> >> you require additional info I would be happy to provide it. > > > > > > Also, are you using "preload_app true" ? > > > > Yes we are using preload_app true. > > > > > I'm a bit curious how these messages are happening, too: > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after > > > suspend/hibernation > > > > They are strange. My current hunch is the killing and that message are > > symptoms of the same issue. Since it always follows a killing. > > I wonder if there's some background thread one of your gems spawns on > load that causes the master to stall. I'm not seeing how else unicorn > could think it was in suspend/hibernation. > > > > Can you tell (from Rails logs) if the to-be-killed workers are still > > > processing requests/responses the 300s before when the unicorn timeout > > > hits it? AFAIK, Rails logs the PID of each worker processing the > > > request. > > > > rails doesn't log the pid but it would seem that after upgrading to > > mysql 0.2.18 it is no longer killing workers that are busy with > > requests. > > Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger > is the default Rails logger :) > > > > If anything, I'd lower the unicorn timeout to something low (maybe > > > 5-10s) since that detects hard lockups at the VM level. Individual > > > requests in Rainbows! _are_ allowed to take longer than the unicorn > > > timeout. > > > > We lowered the unicorn timeout to 5 seconds and but that did not > > change the killings but they seem to be happening less often. I have > > some of our stderr logs after setting the timeout to 5 seconds at > > https://gist.github.com/3144250. > > Thanks for trying that! > > > > Newer versions of mysql2 should avoid potential issues with > > > ThreadTimeout/Timeout (or anything that hits Thread#kill). I think > > > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare > > > (but possibly related to your issue) bug, > > > > Upgrading mysql2 seems to have stopped unicorn from killing workers > > that are currently busy. We were stress testing it last night and > > after we upgraded to 0.2.18 we had no more 502s from the app but this > > could be a coincidence since the killings are still happen. > > Alright, good to know 0.2.18 solved your problems. Btw, have you > noticed any general connectivity issues to your MySQL server? > There were quite a few bugfixes from 0.2.6..0.2.18, though. > > Anyways, I'm happy your problem seems to be fixed with the mysql2 > upgrade :) Unfortunately that didn't fix the problem. We had a large sale today and had 2 502s. We're going to try p194 on next week and I'll let you know if that fixes it. > > Our ops guys say we had this problem before we were using ThreadTimeout. > > OK. That's somewhat reassuring to know (especially since the culprit > seems to be an old mysql2 gem). I've had other users (privately) report > issues with recursive locking because of ensure clauses (e.g. > Mutex#synchronize) that I forgot to document. We're going to try going without ThreadTimeout again to make sure that's not the issue. From normalperson at yhbt.net Thu Jul 19 21:31:25 2012 From: normalperson at yhbt.net (Eric Wong) Date: Thu, 19 Jul 2012 14:31:25 -0700 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> Message-ID: <20120719213125.GA17708@dcvr.yhbt.net> Samuel Kadolph wrote: > On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: > > > > Samuel Kadolph wrote: > > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > > > >> > Samuel Kadolph wrote: > > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If > > > >> >> you require additional info I would be happy to provide it. > > > > > > > > Also, are you using "preload_app true" ? > > > > > > Yes we are using preload_app true. > > > > > > > I'm a bit curious how these messages are happening, too: > > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after > > > > suspend/hibernation > > > > > > They are strange. My current hunch is the killing and that message are > > > symptoms of the same issue. Since it always follows a killing. > > > > I wonder if there's some background thread one of your gems spawns on > > load that causes the master to stall. I'm not seeing how else unicorn > > could think it was in suspend/hibernation. > > Anyways, I'm happy your problem seems to be fixed with the mysql2 > > upgrade :) > > Unfortunately that didn't fix the problem. We had a large sale today > and had 2 502s. We're going to try p194 on next week and I'll let you > know if that fixes it. Are you seeing the same errors as before in stderr for those? Can you also try disabling preload_app? But before disabling preload_app, you can also check a few things on a running master? * "lsof -p " To see if there's odd connections the master is making. * Assuming you're on Linux, can you also check for any other threads the master might be running (and possibly stuck on)? ls /proc//task/ The output should be 2 directories: / / If you have a 3rd entry, you can confirm something in your app one of your gems is spawning a background thread which could be throwing the master off... > > > Our ops guys say we had this problem before we were using ThreadTimeout. > > > > OK. That's somewhat reassuring to know (especially since the culprit > > seems to be an old mysql2 gem). I've had other users (privately) report > > issues with recursive locking because of ensure clauses (e.g. > > Mutex#synchronize) that I forgot to document. > > We're going to try going without ThreadTimeout again to make sure > that's not the issue. Alright. Btw, I also suggest any Rails/application-level logs include the PID and timestamp of the request. This way you can see and correlate the worker killing the request to when/if the Rails app stopped processing requests. From samuel.kadolph at shopify.com Fri Jul 20 00:23:35 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Thu, 19 Jul 2012 20:23:35 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120719213125.GA17708@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> Message-ID: On Thu, Jul 19, 2012 at 5:31 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong wrote: >> > Samuel Kadolph wrote: >> > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: >> > > > Samuel Kadolph wrote: >> > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: >> > > >> > Samuel Kadolph wrote: >> > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why >> > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If >> > > >> >> you require additional info I would be happy to provide it. >> > > > >> > > > Also, are you using "preload_app true" ? >> > > >> > > Yes we are using preload_app true. >> > > >> > > > I'm a bit curious how these messages are happening, too: >> > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after >> > > > suspend/hibernation >> > > >> > > They are strange. My current hunch is the killing and that message are >> > > symptoms of the same issue. Since it always follows a killing. >> > >> > I wonder if there's some background thread one of your gems spawns on >> > load that causes the master to stall. I'm not seeing how else unicorn >> > could think it was in suspend/hibernation. > >> > Anyways, I'm happy your problem seems to be fixed with the mysql2 >> > upgrade :) >> >> Unfortunately that didn't fix the problem. We had a large sale today >> and had 2 502s. We're going to try p194 on next week and I'll let you >> know if that fixes it. > > Are you seeing the same errors as before in stderr for those? Yeah, we get the same killing, reaping and suspend/hibernation messages with the 5 second timeout. Upgrading mysql2 seemed to have prevented any 502s during our stress tests but we that was no the case. > Can you also try disabling preload_app? > > But before disabling preload_app, you can also check a few things on > a running master? > > * "lsof -p " > > To see if there's odd connections the master is making. > > * Assuming you're on Linux, can you also check for any other threads > the master might be running (and possibly stuck on)? > > ls /proc//task/ > > The output should be 2 directories: > > / > / > > If you have a 3rd entry, you can confirm something in your app one of > your gems is spawning a background thread which could be throwing > the master off... I'll see if we can try this tomorrow but it will probably be on Monday. >> > > Our ops guys say we had this problem before we were using ThreadTimeout. >> > >> > OK. That's somewhat reassuring to know (especially since the culprit >> > seems to be an old mysql2 gem). I've had other users (privately) report >> > issues with recursive locking because of ensure clauses (e.g. >> > Mutex#synchronize) that I forgot to document. >> >> We're going to try going without ThreadTimeout again to make sure >> that's not the issue. > > Alright. > > Btw, I also suggest any Rails/application-level logs include the PID and > timestamp of the request. This way you can see and correlate the worker > killing the request to when/if the Rails app stopped processing > requests. We found that one of our servers was actually out of the ELB pool so it wasn't getting pinged constantly and it does not have any killing messages (other than deploys, which also had the suspend/hibernation messages). We'll have more time free next week to dig further into this. From normalperson at yhbt.net Fri Jul 20 02:40:17 2012 From: normalperson at yhbt.net (Eric Wong) Date: Fri, 20 Jul 2012 02:40:17 +0000 Subject: [PATCH] thread_timeout: document additional caveats In-Reply-To: <20120719201403.GA11657@dcvr.yhbt.net> References: <20120719201403.GA11657@dcvr.yhbt.net> Message-ID: <20120720024017.GA6512@dcvr.yhbt.net> Eric Wong wrote: > Again, for the one thousandth time, timing out threads is very > tricky business :< On a related note, it looks like Thread.control_interrupt landed in Ruby trunk (http://svn.ruby-lang.org/repos/ruby/trunk r36470) Perhaps the future for interrupting long-running threads is less bleak. +/* + * call-seq: + * Thread.control_interrupt(hash) { ... } -> result of the block + * + * Thread.control_interrupt controls interrupt timing. + * + * _interrupt_ means asynchronous event and corresponding procedure + * by Thread#raise, Thread#kill, signal trap (not supported yet) + * and main thread termination (if main thread terminates, then all + * other thread will be killed). + * + * _hash_ has pairs of ExceptionClass and TimingSymbol. TimingSymbol + * is one of them: + * - :immediate Invoke interrupt immediately. + * - :on_blocking Invoke interrupt while _BlockingOperation_. + * - :never Never invoke interrupt. + * + * _BlockingOperation_ means that the operation will block the calling thread, + * such as read and write. On CRuby implementation, _BlockingOperation_ is + * operation executed without GVL. + * + * Masked interrupts are delayed until they are enabled. + * This method is similar to sigprocmask(3). + * + * TODO (DOC): control_interrupt is stacked. + * TODO (DOC): check ancestors. + * TODO (DOC): to prevent all interrupt, {Object => :never} works. + * + * NOTE: Asynchronous interrupts are difficult to use. + * If you need to communicate between threads, + * please consider to use another way such as Queue. + * Or use them with deep understanding about this method. + * + * + * # example: Guard from Thread#raise + * th = Thread.new do + * Thead.control_interrupt(RuntimeError => :never) { + * begin + * # Thread#raise doesn't interrupt here. + * # You can write resource allocation code safely. + * Thread.control_interrupt(RuntimeError => :immediate) { + * # ... + * # It is possible to be interrupted by Thread#raise. + * } + * ensure + * # Thread#raise doesn't interrupt here. + * # You can write resource dealocation code safely. + * end + * } + * end + * Thread.pass + * # ... + * th.raise "stop" + * + * # example: Guard from TimeoutError + * require 'timeout' + * Thread.control_interrupt(TimeoutError => :never) { + * timeout(10){ + * # TimeoutError doesn't occur here + * Thread.control_interrupt(TimeoutError => :on_blocking) { + * # possible to be killed by TimeoutError + * # while blocking operation + * } + * # TimeoutError doesn't occur here + * } + * } + * + * # example: Stack control settings + * Thread.control_interrupt(FooError => :never) { + * Thread.control_interrupt(BarError => :never) { + * # FooError and BarError are prohibited. + * } + * } + * + * # example: check ancestors + * Thread.control_interrupt(Exception => :never) { + * # all exceptions inherited from Exception are prohibited. + * } + * + */ + +/* + * call-seq: + * Thread.check_interrupt() -> nil + * + * Check queued interrupts. + * + * If there are queued interrupts, process respective procedures. + * + * This method can be defined as the following Ruby code: + * + * def Thread.check_interrupt + * Thread.control_interrupt(Object => :immediate) { + * Thread.pass + * } + * end + * + * Examples: + * + * th = Thread.new{ + * Thread.control_interrupt(RuntimeError => :on_blocking){ + * while true + * ... + * # reach safe point to invoke interrupt + * Thread.check_interrupt + * ... + * end + * } + * } + * ... + * th.raise # stop thread + * + * NOTE: This example can be described by the another code. + * You need to keep to avoid asynchronous interrupts. + * + * flag = true + * th = Thread.new{ + * Thread.control_interrupt(RuntimeError => :on_blocking){ + * while true + * ... + * # reach safe point to invoke interrupt + * break if flag == false + * ... + * end + * } + * } + * ... + * flag = false # stop thread + */ From ivmaykov at gmail.com Mon Jul 23 23:34:03 2012 From: ivmaykov at gmail.com (Ilya Maykov) Date: Mon, 23 Jul 2012 16:34:03 -0700 Subject: Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle? In-Reply-To: <20120619175409.GA27303@dcvr.yhbt.net> References: <20120619175409.GA27303@dcvr.yhbt.net> Message-ID: Hi Eric, Sorry for the delayed response. I've added inline answers to your questions. We've since resolved this issue by disabling keepalive in our rainbows config. So, this probably had to do with the keepalive implementation either in Rainbows itself or in the base Unicorn code. Answers to your other questions are inlined below. -- Ilya On Tue, Jun 19, 2012 at 10:54 AM, Eric Wong wrote: > Ilya Maykov wrote: >> Hi all, >> >> We're using Rainbows + EventMachine + Sinatra::Synchrony to run a >> fleet of RESTful web servers backed by a Cassandra cluster. We are >> using the EventMachineTransport to talk to Cassandra with an >> EM::Synchrony::ConnectionPool in each rainbows worker. We have a Storm >> cluster pushing a large stream of real-time data into the Rainbows >> fleet using HTTP PUT requests. We're running into some very strange >> performance issues and need help figuring out what's going on. > > I'm not at all familiar with Storm nor Cassandra. How big are the HTTP > PUT requests Rainbows! is getting? Is Storm pipelining HTTP requests by > any chance? That may not do well with the EM portion of Rainbows! The PUT requests are small - probably about 100 bytes of JSON payload per request (not counting HTTP overhead, headers, etc). Pipelining is disabled on the clients, though keep-alive is not. > >> Basically, when load is low, everything looks good. When we crank up >> the load, all of a sudden the CPU gets pegged, request latencies go >> waaaay up, and requests start timing out. Once this state is reached, >> the high CPU usage (4 rainbows worker processes at ~50% each on a >> 2-core machine = nearly full load) remains even if we completely shut >> off all incoming traffic. > >> Taking a look with strace -p, it looks like >> the rainbows processes are writing ascii NUL characters to file >> descriptor 7 (which is a FIFO) as fast as the kernel will let them. My >> guess is that the worker is trying to communicate with the rainbows >> master process via the FIFO. > > No, rainbows doesn't have code to send "\0" to the master. I > don't think EM does, either, maybe some other library you're > using... > No idea ... I think EventMachine does start up a background thread, not sure if that would affect anything in a bad way. > Which version of Ruby is this? Try adding "-f" to follow threads > for a worker. > This is with Ruby 1.9.2-p290, installed by using rbenv-installer. >> Not sure what is triggering this >> behavior, but would like to know if anyone else has ever seen >> something like this. This thread sounded like it could've been a >> similar issue, but died out without any conclusion: >> http://rubyforge.org/pipermail/rainbows-talk/2012-April/000345.html >> >> Some details about the setup: >> >> 6-node cassandra cluster >> 3 nodes running rainbows web servers >> 4 rainbows workers per node >> max of 50 cassandra connections per rainbows worker >> rainbows.conf has: >> >> Rainbows! do >> use :EventMachine >> worker_connections 50 >> keepalive_requests 1000 >> keepalive_timeout 10 >> end > > Do you have preload_app set to true anywhere? (Try leaving it as false > (the default)) > No, we are not using preload_app. > Can you also try "keepalive_timeout 0" to disable keepalive? (EM > handles it internally, but I'm not sure how well) > This turned out to be the problem. Disabling keepalive got rid of the CPU pegging. Surprisingly, it also made our average latency drop from about 50 ms to about 20 ms per request, even though every request now has to negotiate a connection handshake. So, we're just going to keep it disabled for now. The bug may be inside the keepalive code in Rainbows or Unicorn (not familiar with the codebase so not sure where that code lives). >> So, each rainbows node can handle 4 * 50 = 200 simultaneous connections >> >> 12 Storm worker processes writing to the rainbows web servers >> each Storm worker has max of 10 connections open to each of the 3 rainbows nodes >> >> So, each rainbows node has 12 * 10 = 120 incoming connections from Storm. >> >> Have been playing around with the numbers, the bug (assuming it is a >> bug) seems to be easier to trigger when I increase the number of >> incoming connections (from Storm workers), even if they are a lot less >> than the rainbows servers can take (60-70% of the max connections is >> usually enough). The bug is also easier to trigger when we increase >> the volume of data we're pushing through Storm - hundreds or thousands >> of requests per minute, no bug - hundreds of thousands of requests per >> minute, yes bug. Cassandra is not the issue, it can easily take the >> write load we're generating and is basically idle. >> >> Any help in figuring this out would be greatly appreciated. Thanks, > > Try my suggestions above. > > I would also search your libs/gems for what's writing "\0" since I don't > think it's Rainbows!... > _______________________________________________ > Rainbows! mailing list - rainbows-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/rainbows-talk > Do not quote signatures (like this one) or top post when replying From normalperson at yhbt.net Tue Jul 24 00:21:40 2012 From: normalperson at yhbt.net (Eric Wong) Date: Mon, 23 Jul 2012 17:21:40 -0700 Subject: Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle? In-Reply-To: References: <20120619175409.GA27303@dcvr.yhbt.net> Message-ID: <20120724002140.GA25177@dcvr.yhbt.net> Ilya Maykov wrote: > Hi Eric, > > Sorry for the delayed response. I've added inline answers to your > questions. We've since resolved this issue by disabling keepalive in > our rainbows config. So, this probably had to do with the keepalive > implementation either in Rainbows itself or in the base Unicorn code. > Answers to your other questions are inlined below. Thank you very much for the follow-up (I wish more folks would do this :) If Rainbows! is using EventMachine, it'll use the EM.set_comm_inactivity_timeout method in EventMachine. > On Tue, Jun 19, 2012 at 10:54 AM, Eric Wong wrote: > > Ilya Maykov wrote: > > Can you also try "keepalive_timeout 0" to disable keepalive? (EM > > handles it internally, but I'm not sure how well) > > > > This turned out to be the problem. Disabling keepalive got rid of the > CPU pegging. Surprisingly, it also made our average latency drop from > about 50 ms to about 20 ms per request, even though every request now > has to negotiate a connection handshake. So, we're just going to keep > it disabled for now. The bug may be inside the keepalive code in > Rainbows or Unicorn (not familiar with the codebase so not sure where > that code lives). Are you setting Content-Length or "Transfer-Encoding: chunked" in responses? Rack::ContentLength or Rack::Chunked middleware might need to be loaded if your framework doesn't already include it. Lack of these headers may confuse clients, even... Where did you measure the 50 -> 20ms latency drop, from the client? About the latency drop: Was :tcp_cork enabled in your listen directive? I wonder if there's some bad interaction with :tcp_cork + EM which might explain the latency (but not the CPU usage). Disabling keepalive would force data out immediately and avoid any bad effects of :tcp_cork.j Can you also try listen() with :tcp_defer_accept => 0? That might help if you're accept()-ing many connections at once. Anyways, the TCP connection handshake is very fast on LANs/localhost, but hurts on high-latency connections (I seem to remember mainstream web browsers double simultaneous requests to compensate for lack of keepalive). Which version of EM are you using? The keepalive implementation for EM+Rainbows! is entirely handled by EM. From normalperson at yhbt.net Thu Jul 26 23:48:45 2012 From: normalperson at yhbt.net (Eric Wong) Date: Thu, 26 Jul 2012 23:48:45 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> Message-ID: <20120726234845.GA29453@dcvr.yhbt.net> Samuel Kadolph wrote: > We'll have more time free next week to dig further into this. Hi Samuel, any update on this? From samuel.kadolph at shopify.com Fri Jul 27 00:00:46 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Thu, 26 Jul 2012 20:00:46 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120726234845.GA29453@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> <20120726234845.GA29453@dcvr.yhbt.net> Message-ID: On Thu, Jul 26, 2012 at 7:48 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> We'll have more time free next week to dig further into this. > > Hi Samuel, any update on this? Our ops guys have been busy so I don't have the output from lsof but it didn't look like it was spawning any extra threads or opening any unexplainable connections. But I think we should have been checking the worker processes and not the master, right? Haven't tried disabling preload_app yet but we have tried ruby-1.9.3-p194 and that did not resolve the issue. We've also upgraded to rails 3.2 and that also did not resolve the issue. From normalperson at yhbt.net Fri Jul 27 00:11:25 2012 From: normalperson at yhbt.net (Eric Wong) Date: Fri, 27 Jul 2012 00:11:25 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> <20120726234845.GA29453@dcvr.yhbt.net> Message-ID: <20120727001125.GA30957@dcvr.yhbt.net> Samuel Kadolph wrote: > On Thu, Jul 26, 2012 at 7:48 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > >> We'll have more time free next week to dig further into this. > > > > Hi Samuel, any update on this? > > Our ops guys have been busy so I don't have the output from lsof but > it didn't look like it was spawning any extra threads or opening any > unexplainable connections. But I think we should have been checking > the worker processes and not the master, right? Definitely check the master, too. It's the master that seems to believe it's suspended, so that makes me believe something is wrong with the master (and this is likely due to preload_app). > Haven't tried disabling preload_app yet but we have tried From samuel.kadolph at shopify.com Fri Jul 27 20:01:08 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Fri, 27 Jul 2012 16:01:08 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120727001125.GA30957@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> <20120726234845.GA29453@dcvr.yhbt.net> <20120727001125.GA30957@dcvr.yhbt.net> Message-ID: On Thu, Jul 26, 2012 at 8:11 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> On Thu, Jul 26, 2012 at 7:48 PM, Eric Wong wrote: >> > Samuel Kadolph wrote: >> >> We'll have more time free next week to dig further into this. >> > >> > Hi Samuel, any update on this? >> >> Our ops guys have been busy so I don't have the output from lsof but >> it didn't look like it was spawning any extra threads or opening any >> unexplainable connections. But I think we should have been checking >> the worker processes and not the master, right? > > Definitely check the master, too. It's the master that seems to > believe it's suspended, so that makes me believe something is wrong > with the master (and this is likely due to preload_app). > >> Haven't tried disabling preload_app yet but we have tried I've got the output of lsof and ls at https://gist.github.com/3190171. From normalperson at yhbt.net Fri Jul 27 20:40:40 2012 From: normalperson at yhbt.net (Eric Wong) Date: Fri, 27 Jul 2012 20:40:40 +0000 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> <20120726234845.GA29453@dcvr.yhbt.net> <20120727001125.GA30957@dcvr.yhbt.net> Message-ID: <20120727204040.GA2192@dcvr.yhbt.net> Samuel Kadolph wrote: > On Thu, Jul 26, 2012 at 8:11 PM, Eric Wong wrote: > >> Our ops guys have been busy so I don't have the output from lsof but > >> it didn't look like it was spawning any extra threads or opening any > >> unexplainable connections. But I think we should have been checking > >> the worker processes and not the master, right? > > > > Definitely check the master, too. It's the master that seems to > > believe it's suspended, so that makes me believe something is wrong > > with the master (and this is likely due to preload_app). > > > >> Haven't tried disabling preload_app yet but we have tried > > I've got the output of lsof and ls at https://gist.github.com/3190171. Thanks, that's the output for the master? I don't see anything obviously wrong. I seem to recall the Ruby library responsible for the following log file also spawns its own background thread, but your "ls" only shows 2 tasks (instead of 3): > ruby 26564 root 9w REG 202,1 51221 529742 APP_PATH/shared/log/newrelic_agent.log > $ ls /proc/26564/task/ > 26564 27052 (While the Ruby code for the module responsible for that log file is technically "open", it's not Free, so I'm not comfortable looking at that code). From samuel.kadolph at shopify.com Tue Jul 31 14:09:08 2012 From: samuel.kadolph at shopify.com (Samuel Kadolph) Date: Tue, 31 Jul 2012 10:09:08 -0400 Subject: Unicorn is killing our rainbows workers In-Reply-To: <20120727204040.GA2192@dcvr.yhbt.net> References: <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> <20120726234845.GA29453@dcvr.yhbt.net> <20120727001125.GA30957@dcvr.yhbt.net> <20120727204040.GA2192@dcvr.yhbt.net> Message-ID: On Fri, Jul 27, 2012 at 4:40 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> On Thu, Jul 26, 2012 at 8:11 PM, Eric Wong wrote: >> >> Our ops guys have been busy so I don't have the output from lsof but >> >> it didn't look like it was spawning any extra threads or opening any >> >> unexplainable connections. But I think we should have been checking >> >> the worker processes and not the master, right? >> > >> > Definitely check the master, too. It's the master that seems to >> > believe it's suspended, so that makes me believe something is wrong >> > with the master (and this is likely due to preload_app). >> > >> >> Haven't tried disabling preload_app yet but we have tried >> >> I've got the output of lsof and ls at https://gist.github.com/3190171. > > Thanks, that's the output for the master? I don't see anything > obviously wrong. > > I seem to recall the Ruby library responsible for the following log file > also spawns its own background thread, but your "ls" only shows 2 tasks > (instead of 3): > >> ruby 26564 root 9w REG 202,1 51221 529742 APP_PATH/shared/log/newrelic_agent.log > >> $ ls /proc/26564/task/ >> 26564 27052 > > (While the Ruby code for the module responsible for that log file is > technically "open", it's not Free, so I'm not comfortable looking at > that code). So 2 updates: yes that lsof output is from the master process and using preload_app false solves the issue. No more killings and the suspend/hibernation messages stopped as well. We lost newrelic data so we're going to try putting preload_app back to true and removing the newrelic gem. From normalperson at yhbt.net Tue Jul 31 20:28:19 2012 From: normalperson at yhbt.net (Eric Wong) Date: Tue, 31 Jul 2012 13:28:19 -0700 Subject: Unicorn is killing our rainbows workers In-Reply-To: References: <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> <20120726234845.GA29453@dcvr.yhbt.net> <20120727001125.GA30957@dcvr.yhbt.net> <20120727204040.GA2192@dcvr.yhbt.net> Message-ID: <20120731202819.GA6417@dcvr.yhbt.net> Samuel Kadolph wrote: > So 2 updates: yes that lsof output is from the master process and > using preload_app false solves the issue. No more killings and the > suspend/hibernation messages stopped as well. We lost newrelic data so > we're going to try putting preload_app back to true and removing the > newrelic gem. Thank you for the updates and reporting the resolution! Hopefully all goes well with other gems.