Sending ABRT to timeout-errant process before KILL
Eric Wong
normalperson at yhbt.net
Thu Sep 8 15:13:52 EDT 2011
"J. Austin Hughey" <jhughey at engineyard.com> wrote:
> The general idea is that I'd like to have some way to "warn" the
> application when it's about to be killed. I've patched
> murder_lazy_workers to send an abort signal via kill_worker, sleep for
> 5 seconds, then look and see if the process still exists by using
> Process.getpgid. If so, it sends the original kill command, and if
> not, a rescue block kicks in to prevent the raised error from
> Process.getpgid from making things explode.
The problem with anything other than SIGKILL (or SIGSTOP) is that it
assumes the Ruby VM is working and in a good state.
> I've created a simulation app, built on Rails 3.0, that uses a generic
> "posts" controller to simulate a long-running request. Instead of
> just throwing a straight-up sleep 65 in there, I have it running
> through sleeping 1 second on a decrementing counter, and doing that 65
> times. The reason is because, assuming I've read the code correctly,
> even with my "skip sleeping workers" commented line below, it'll skip
> over the process, thus rendering my simulation of a long-running
> process invalid. However, clarification on this point is certainly
> welcome. You can see the app here:
> https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb
(purely for educational purposes, since I'll point you towards another
approach I believe is better)
Signal.trap(:ABRT) do
# Write some stuff to the Rails log
logger.info "Caught Unicorn kill exception!"
If this is the logger that ships with Ruby, it locks a Mutex, so it'll
deadlock if another SIGABRT is received while logging the above
statement (a very small window, admittedly).
# Do a controlled disconnect from ActiveRecord
ActiveRecord::Base.connection.disconnect!
Likewise, if AR needs to lock internal structures before disconnecting,
it also must be reentrant. Ruby's normal Mutex implementation is not
reentrant-safe.
> So it looks like Worker 1 is hitting a strange/false timeout of
> 1315467289 seconds, which isn't really possible as it wasn't even
> running 1315467289 seconds prior to that (which equates to roughly 41
> years ago if my math is right).
You're getting this because you removed the following line:
0 == tick and next # skip workers that are sleeping
sleeping means they haven't accepted a client connection, yet. Not
sleeping while processing a client request. I'll clarify that in the
code.
> Needless to say, I'm a bit stumped at this point, and would sincerely
> appreciate another point of view on this. Am I going about this all
> wrong? Is there a better approach I should consider? And if I'm on
> the right track, how can I get this to work regardless of how many
> Unicorn workers are running?
Since it's an application error, it can be done as middleware. You can
try something like the Rainbows::ThreadTimeout middleware, it's
currently Rainbows! specific but can easily be made to work with
Unicorn.
git clone git://bogomips.org/rainbows
cat rainbows/lib/rainbows/thread_timeout.rb
This is conceptually similar to "timeout" in the Ruby standard library,
but does not allow nesting.
I'll try to clarify more later today if you have questions, in a bit of
a rush right now.
--
Eric Wong
More information about the mongrel-unicorn
mailing list