A barrage of unexplained timeouts
nick at auger.net
nick at auger.net
Tue Aug 20 20:03:30 UTC 2013
"Eric Wong" <normalperson at yhbt.net> said:
> nick at auger.net wrote:
>> "Eric Wong" <normalperson at yhbt.net> said:
>> > Do you have any other requests in your logs which could be taking
>> > a long time and hogging workers, but not high enough to trigger the
>> > unicorn kill timeout.
>> I don't *think* so. Most requests finish <300ms. We do have some
>> more intensive code-paths, but they're administrative and called much
>> less frequently. Most of these pages complete in <3seconds.
>> For requests that made it to rails logging, the LAST processed request
>> before the worker timed-out all completed very quickly (and no real
>> pattern in terms of which page may be triggering it.)
> This is really strange. This was only really bad for a 7s period?
It was a 7 minute period. All of the workers would become busy and exceed their >120s timeout. Master would kill and re-spawn them, they'd start to respond to a handful of requests (anywhere from 5-50) after which they'd become "busy" again, and get force killed by master. This pattern happened 3 times over a 7 minute period.
> Has it happened again?
> Anything else going on with the system at that time? Swapping, particularly...
No swap activity or high load. Our munin graphs indicate a peak of web/app server disk latency around that time, although our graphs show many other similar peaks, without incident.
> And if you're inside a VM, maybe your neighbors were hogging things.
> Large PUT/POST requests which require filesystem I/O are particularly
> sensitive to this.
We can't blame virtualization as we're on dedicated hardware.
>> > Is this with Unix or TCP sockets? If it's over a LAN, maybe there's
>> > still a bad switch/port/cable somewhere (that happens often to me).
>> TCP sockets, with nginx and unicorn running on the same box.
> OK, that probably rules out a bunch of problems.
> Just to be thorough, anything interesting in dmesg or syslogs?
Nothing interesting in /var/log/messages. dmesg (no timestamps) includes lines like:
TCP: Treason uncloaked! Peer ###.###.###.###:63554/80 shrinks window 1149440448:1149447132. Repaired.
>> > With Unix sockets, I don't recall encountering recent problems under
>> > Linux. Which OS are you running?
>> Stock RHEL 5, kernel 2.6.18.
> RHEL 5.0 or 5.x? I can't remember /that/ far back to 5.0 (I don't think
> I even tried it until 5.2), but don't recall anything being obviously
> broken in those...
We're on RHEL 5.6.
More information about the mongrel-unicorn