A barrage of unexplained timeouts

nick at auger.net nick at auger.net
Tue Aug 20 20:03:30 UTC 2013


"Eric Wong" <normalperson at yhbt.net> said:

> nick at auger.net wrote:
>> "Eric Wong" <normalperson at yhbt.net> said:
>> >
>> > Do you have any other requests in your logs which could be taking
>> > a long time and hogging workers, but not high enough to trigger the
>> > unicorn kill timeout.
> 
>> I don't *think* so.  Most requests finish <300ms.  We do have some
>> more intensive code-paths, but they're administrative and called much
>> less frequently.  Most of these pages complete in <3seconds.
>>
>> For requests that made it to rails logging, the LAST processed request
>> before the worker timed-out all completed very quickly (and no real
>> pattern in terms of which page may be triggering it.)
> 
> This is really strange.  This was only really bad for a 7s period?

It was a 7 minute period.  All of the workers would become busy and exceed their >120s timeout.  Master would kill and re-spawn them, they'd start to respond to a handful of requests (anywhere from 5-50) after which they'd become "busy" again, and get force killed by master.  This pattern happened 3 times over a 7 minute period.

> Has it happened again?

No

> Anything else going on with the system at that time?  Swapping, particularly...

No swap activity or high load.  Our munin graphs indicate a peak of web/app server disk latency around that time, although our graphs show many other similar peaks, without incident.

> And if you're inside a VM, maybe your neighbors were hogging things.
> Large PUT/POST requests which require filesystem I/O are particularly
> sensitive to this.

We can't blame virtualization as we're on dedicated hardware.

>> > Is this with Unix or TCP sockets?  If it's over a LAN, maybe there's
>> > still a bad switch/port/cable somewhere (that happens often to me).
>>
>> TCP sockets, with nginx and unicorn running on the same box.
> 
> OK, that probably rules out a bunch of problems.
> Just to be thorough, anything interesting in dmesg or syslogs?

Nothing interesting in /var/log/messages.  dmesg (no timestamps) includes lines like:

TCP: Treason uncloaked! Peer ###.###.###.###:63554/80 shrinks window 1149440448:1149447132. Repaired.

>> > With Unix sockets, I don't recall encountering recent problems under
>> > Linux.  Which OS are you running?
>>
>> Stock RHEL 5, kernel 2.6.18.
> 
> RHEL 5.0 or 5.x?  I can't remember /that/ far back to 5.0 (I don't think
> I even tried it until 5.2), but don't recall anything being obviously
> broken in those...

We're on RHEL 5.6.

Thanks,

-Nick



More information about the mongrel-unicorn mailing list