Unicorn is killing our rainbows workers

Samuel Kadolph samuel.kadolph at shopify.com
Thu Jul 19 20:57:31 UTC 2012


On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong <normalperson at yhbt.net> wrote:
>
> Samuel Kadolph <samuel.kadolph at shopify.com> wrote:
> > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson at yhbt.net> wrote:
> > > Samuel Kadolph <samuel.kadolph at shopify.com> wrote:
> > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson at yhbt.net> wrote:
> > >> > Samuel Kadolph <samuel.kadolph at jadedpixel.com> wrote:
> > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
> > >> >> unicorn is killing our ThreadPool workers would help us greatly. If
> > >> >> you require additional info I would be happy to provide it.
> > >
> > > Also, are you using "preload_app true" ?
> >
> > Yes we are using preload_app true.
> >
> > > I'm a bit curious how these messages are happening, too:
> > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
> > > suspend/hibernation
> >
> > They are strange. My current hunch is the killing and that message are
> > symptoms of the same issue. Since it always follows a killing.
>
> I wonder if there's some background thread one of your gems spawns on
> load that causes the master to stall.  I'm not seeing how else unicorn
> could think it was in suspend/hibernation.
>
> > > Can you tell (from Rails logs) if the to-be-killed workers are still
> > > processing requests/responses the 300s before when the unicorn timeout
> > > hits it?  AFAIK, Rails logs the PID of each worker processing the
> > > request.
> >
> > rails doesn't log the pid but it would seem that after upgrading to
> > mysql 0.2.18 it is no longer killing workers that are busy with
> > requests.
>
> Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger
> is the default Rails logger :)
>
> > > If anything, I'd lower the unicorn timeout to something low (maybe
> > > 5-10s) since that detects hard lockups at the VM level.  Individual
> > > requests in Rainbows! _are_ allowed to take longer than the unicorn
> > > timeout.
> >
> > We lowered the unicorn timeout to 5 seconds and but that did not
> > change the killings but they seem to be happening less often. I have
> > some of our stderr logs after setting the timeout to 5 seconds at
> > https://gist.github.com/3144250.
>
> Thanks for trying that!
>
> > > Newer versions of mysql2 should avoid potential issues with
> > > ThreadTimeout/Timeout (or anything that hits Thread#kill).  I think
> > > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare
> > > (but possibly related to your issue) bug,
> >
> > Upgrading mysql2 seems to have stopped unicorn from killing workers
> > that are currently busy. We were stress testing it last night and
> > after we upgraded to 0.2.18 we had no more 502s from the app but this
> > could be a coincidence since the killings are still happen.
>
> Alright, good to know 0.2.18 solved your problems.  Btw, have you
> noticed any general connectivity issues to your MySQL server?
> There were quite a few bugfixes from 0.2.6..0.2.18, though.
>
> Anyways, I'm happy your problem seems to be fixed with the mysql2
> upgrade :)

Unfortunately that didn't fix the problem. We had a large sale today
and had 2 502s. We're going to try p194 on next week and I'll let you
know if that fixes it.

> > Our ops guys say we had this problem before we were using ThreadTimeout.
>
> OK.  That's somewhat reassuring to know (especially since the culprit
> seems to be an old mysql2 gem).  I've had other users (privately) report
> issues with recursive locking because of ensure clauses (e.g.
> Mutex#synchronize) that I forgot to document.

We're going to try going without ThreadTimeout again to make sure
that's not the issue.


More information about the rainbows-talk mailing list