Problem with binding UNIX listeners before checking PID
normalperson at yhbt.net
Mon Oct 4 00:17:13 EDT 2010
Jordan Ritter <jpr5 at darkridge.com> wrote:
> I have lately been frustrated by the following use case:
> 1. Run nginx/unicorn in production, listening on a UNIX socket
> with a defined pid file. Things run good.
> 2. Someone pushes code, unicorn restarts just fine, workers are
> all up and running.
> 3. But someone is suspicious, or maybe they forget which
> box they're logged into, so they invoke unicorn manually.
> Same directory, same settings.
> 4. It looks like the pid file check kicked in, because unicorn
> refuses to boot - hey, it's already running, bugger off. great.
> 5. BUT, this happened *after* the listener processing: the
> manually-invoked unicorn unlinks the real unicorn master's unix
> listener, so it's left dead in the water and everybody loses.
> unicorn master doesn't know its listener is actually gone (but lsof shows
> open unix socket fd, netstat shows unix socket still present, so cursory
> investigation is misleading), but nginx keeps spewing ECONNREFUSEDs
> because the unix socket it's hitting belongs to that accidental unicorn
> instance that already decided not to stick around.
> I think this is effectively about a behavioral difference in
> Unicorn::SocketHelper#bind_listen around the handling of UNIX vs. TCP
> sockets (this doesn't happen with TCP sockets because there's no
> unlink/disconnect step), and the fact that HttpServer#start evaluates
> the listener config before the PID path/config.
> Now I see comments in and around HttpServer#initialize talking about races
> wrt binding to the listener and whatnot, and being newish to the codebase
> I admit I haven't yet fully absorbed all the considerations at play.
> But I think it's fair to say that killing the listener(s) (in the UNIX
> socket case) before discovering you shouldn't have run in the first place
> (from the PID file) qualifies as buggy/bad/broken behavior.
Thanks for the detailed bug report. I knew from experience with other
daemons that lingering UNIX sockets caused troubles for some users, but
I failed to take into account the case where a user mistakenly starting
the process twice.
Yes, getting pid file writing/ordering "right" is very tricky.
> I might suggest simply swapping their processing order in #start, but
> given the complexity of in-place restarts and other race considerations,
> I have doubts solving this would be that easy.
That wouldn't work if pid files weren't in use at all.
> Any thoughts/ideas?
A simpler check would be to use connect(2) (but not make any HTTP request)
to see if the socket is alive. Patch coming.
 - I don't believe there actually is a way to always be right,
just less bad/broken than the alternatives.
More information about the mongrel-unicorn