I have a complex vlad deploy script that worked fine on our test pods, but sometimes failed when I ran it on our production
site, which has 30 hosts. I have not pinpointed the problem, but the more run's there are, the worse it was.
The original error was "Too many open files". I added `/usr/sbin/lsof -p $$ | wc -l` to see how many open
files there were, and I got up to 800, which is far more than there should be. Our IT doubled "ulimit -n",
but then I started getting random errors. It is simply using too many resources.
My work-around was to reduce the number of run's by adding commands to an array and running them all at once with set
-e.
I do not know what pushes it over the edge, but one possibility is that I do my own conditional update/restart based
on the error code of a remote diff. That requires catching the exception from run -- unless there is a better way.
I will also open a feature request on that.
|