A barrage of unexplained timeouts
nick at auger.net
nick at auger.net
Tue Aug 20 21:19:05 UTC 2013
"Eric Wong" <normalperson at yhbt.net> said:
> nick at auger.net wrote:
>> "Eric Wong" <normalperson at yhbt.net> said:
>> > This is really strange. This was only really bad for a 7s period?
>> It was a 7 minute period. All of the workers would become busy and
>> exceed their >120s timeout. Master would kill and re-spawn them,
>> they'd start to respond to a handful of requests (anywhere from 5-50)
>> after which they'd become "busy" again, and get force killed by
>> master. This pattern happened 3 times over a 7 minute period.
>> > Has it happened again?
>> > Anything else going on with the system at that time? Swapping,
>> > particularly...
>> No swap activity or high load. Our munin graphs indicate a peak of
>> web/app server disk latency around that time, although our graphs show
>> many other similar peaks, without incident.
> I'm stumped :<
I was afraid you'd say that :(.
> Do you have any background threads running that could be hanging the
> workers? This is Ruby 1.8, after all, so there's more likely to be
> some blocking call hanging the entire process. AFAIK, some monitoring
> software runs a background thread in the unicorn worker and maybe the
> OpenSSL extension doesn't work as well if it encountered network
> problems under Ruby 1.8
We don't explicitly create any threads in our rails code. We do communicate with backgroundrb worker processes, although, none of the strangeness today involved any routes that would hit backgroundrb workers.
Is there any instrumentation that I could add that might help debugging in the future? ($request_time and $upstream_response_time are now in my nginx logs.) We have noticed these "unexplainable timeouts" before, but typically for a single worker. If there's some debugging that could be added I might be able to track it down during these one-off events.
I absolutely appreciate all your help!
More information about the mongrel-unicorn