Unicorn is killing our rainbows workers

Samuel Kadolph samuel.kadolph at shopify.com
Fri Jul 20 00:23:35 UTC 2012


On Thu, Jul 19, 2012 at 5:31 PM, Eric Wong <normalperson at yhbt.net> wrote:
> Samuel Kadolph <samuel.kadolph at shopify.com> wrote:
>> On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong <normalperson at yhbt.net> wrote:
>> > Samuel Kadolph <samuel.kadolph at shopify.com> wrote:
>> > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson at yhbt.net> wrote:
>> > > > Samuel Kadolph <samuel.kadolph at shopify.com> wrote:
>> > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson at yhbt.net> wrote:
>> > > >> > Samuel Kadolph <samuel.kadolph at jadedpixel.com> wrote:
>> > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
>> > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If
>> > > >> >> you require additional info I would be happy to provide it.
>> > > >
>> > > > Also, are you using "preload_app true" ?
>> > >
>> > > Yes we are using preload_app true.
>> > >
>> > > > I'm a bit curious how these messages are happening, too:
>> > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
>> > > > suspend/hibernation
>> > >
>> > > They are strange. My current hunch is the killing and that message are
>> > > symptoms of the same issue. Since it always follows a killing.
>> >
>> > I wonder if there's some background thread one of your gems spawns on
>> > load that causes the master to stall.  I'm not seeing how else unicorn
>> > could think it was in suspend/hibernation.
>
>> > Anyways, I'm happy your problem seems to be fixed with the mysql2
>> > upgrade :)
>>
>> Unfortunately that didn't fix the problem. We had a large sale today
>> and had 2 502s. We're going to try p194 on next week and I'll let you
>> know if that fixes it.
>
> Are you seeing the same errors as before in stderr for those?

Yeah, we get the same killing, reaping and suspend/hibernation
messages with the 5 second timeout. Upgrading mysql2 seemed to have
prevented any 502s during our stress tests but we that was no the
case.

> Can you also try disabling preload_app?
>
> But before disabling preload_app, you can also check a few things on
> a running master?
>
> * "lsof -p <pid_of_master>"
>
>   To see if there's odd connections the master is making.
>
> * Assuming you're on Linux, can you also check for any other threads
>   the master might be running (and possibly stuck on)?
>
>     ls /proc/<pid_of_master>/task/
>
>   The output should be 2 directories:
>
>     <pid_of_master>/
>     <tid_of_timer_thread>/
>
>   If you have a 3rd entry, you can confirm something in your app one of
>   your gems is spawning a background thread which could be throwing
>   the master off...

I'll see if we can try this tomorrow but it will probably be on Monday.

>> > > Our ops guys say we had this problem before we were using ThreadTimeout.
>> >
>> > OK.  That's somewhat reassuring to know (especially since the culprit
>> > seems to be an old mysql2 gem).  I've had other users (privately) report
>> > issues with recursive locking because of ensure clauses (e.g.
>> > Mutex#synchronize) that I forgot to document.
>>
>> We're going to try going without ThreadTimeout again to make sure
>> that's not the issue.
>
> Alright.
>
> Btw, I also suggest any Rails/application-level logs include the PID and
> timestamp of the request.  This way you can see and correlate the worker
> killing the request to when/if the Rails app stopped processing
> requests.

We found that one of our servers was actually out of the ELB pool so
it wasn't getting pinged constantly and it does not have any killing
messages (other than deploys, which also had the suspend/hibernation
messages). We'll have more time free next week to dig further into
this.


More information about the rainbows-talk mailing list