More unexplained timeouts
nick at auger.net
nick at auger.net
Sun Sep 29 20:13:49 UTC 2013
We're still suffering from unexplained workers timing out. We recently upgraded to the latest unicorn 4.6.3 (while still on REE 1.8.7) in the hopes that it would solve our issues. Unfortunately, this seemed to exacerbate the problem, with timeouts happening more frequently, but that could be related to greater precision in timeouts in newer versions of unicorn. (In our unicorn 3.6.2, a timeout set to 120s might not ACTUALLY timeout until 180s or more, thus allowing a bit more time for Ruby to finish whatever it was choking on.)
We dropped the timeout down to 65s (to make sure it was triggered) and then tried to add greater logging (per http://permalink.gmane.org/gmane.comp.lang.ruby.unicorn.general/1269.) The START/FINISH approach confirms it's not an issue with our application code, ie:
HH:MM:SS- S/F[PID]- /PATH
15:21:01- START-25904- /pathA
15:21:01- FINISH-25904- /pathA
15:21:01- START-25904- /pathB
15:21:01- FINISH-25904- /pathB
15:21:01- START-25904- /pathC
15:21:01- FINISH-25904- /pathC
worker=11 PID:25904 timeout (66s > 65s), killing
reaped #<Process::Status: pid=25904,signaled(SIGKILL=9)> worker=11
For each START we always get a corresponding FINISH and then the worker is killed. Additionally, our nginx logs confirm that this last request was sent back to the client. No 'upstream' errors in our nginx log, either.
When we tried the Thread sleep approach, nothing actually appeared in the logs. I imagine this means that ruby or some C extension is misbehaving.
Unfortunately, it's been impossible for us to recreate this in development.
REE 1.8.7 2011.12
16 unicorn workers on 8 cores
No swap activity, no peaks in load
Again, thanks for all your help!
More information about the mongrel-unicorn