[Mongrel] random cpu spikes, EBADF errors

Zachary Powell zach at plugthegap.co.uk
Mon Oct 29 16:09:17 EDT 2007


Hi All,
Follow up to the CPU/EBADF issue I was having with lsws:


http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost
Here is the message that has just been posted:
***************
The problem is on mongrel side. As shown in the strace output, file handle 5
is the reverse proxy connection from LSWS to mongrel. Mongrel read the
request, then it closed the connection immediately without sending back
anything, then try to close it again with result EBADF, because the file
descriptor has been closed already.

When mongrel was working, it should send the reply back to LSWS before
closing the socket.

The root cause of the problem is on Mongrel side, however, LSWS should fail
the request after a few retries. We will implement that in our 3.3 release.
***************


Zach

On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote:
>
> Hi, short follow up. I did
>
> lsof -p 27787 | grep mongrel  -c
> => 64 files open
>
>
> while it was spiking, so it doesn't look like a max file issue. Looking
> thorough the list, the only thing of note was
>
>
> mongrel_r 27787 rails    5w  unknown
>  /proc/27787/fd/5 (readlink: No such file or directory)
>
> 5 was likely the id of the file getting EBADF (usually 4-6). I didn't
> catch which file it was before it the problem ended (running lsof was very
> slow, took 20 secs while problem was occuring), but previously when I've
> checked its always been there and accessable from web. Also, the dir in
> question is full of files, so it hasn't been sweeped recently (couldn't have
> been cache-expiry while reading or anything weird like that).
>
>
> Zach
>
>
>
>
> On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote:
> >
> > Hi Guys,
> >
> > I'm using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, and
> > Ruby 1.8.6, running Red Hat Enterprise Linux ES 4 . We have
> >
> >
> > Web/Mysql/Mail server:
> >
> > # RAID Configuration: RAID 1 (73GBx2)
> >
> > # HP Memory: 2 GB HP RAM
> >
> > # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz
> >
> >
> > (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue occurs
> > with litespeed taking ~30% cpu)
> >
> >
> > and App Server:
> >
> >
> > # RAID Configuration: RAID 1 (146GBx2)
> >
> > # HP Memory: 4 GB HP RAM
> >
> > # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz
> >
> >
> > (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend
> > processes. Spikes up to 2-4 when it happens, depending on how many mongrels
> > get the problem (sometimes 2))
> >
> >
> > And these are the  mongrels running:
> >
> >
> > MONGREL   CPU  MEM  VIR RES  DATE  TIME   PID
> > 8010:hq   3.2  3.4  145 138  Oct23 43:13  20409
> > 8011:hq   0.6  3.0  132 125  Oct23 8:15   20412
> > 8012:hq   0.1  1.8   81  74  Oct23 1:28   20415
> > 8015:dhq  0.0  1.0   50  44  02:41 0:08   4775
> > 8016:dhq  0.0  0.7   34  30  02:41 0:01   4778
> > 8017:dhq  0.0  0.7   36  30  02:41 0:01   4781
> > 8020:af   9.0  3.3  143 137  Oct23 114:41 26600
> > 8021:af   5.6  2.0   90  84  Oct23 71:56  26607
> > 8022:af   2.4  1.8   80  74  Oct23 30:37  26578
> > 8025:daf  0.0  1.0   49  42  02:41 0:04   4842
> > 8026:daf  0.0  0.7   34  30  02:41 0:02   4845
> > 8027:daf  0.0  0.7   36  30  02:41 0:02   4848
> > 8030:pr   0.1  1.5   67  61  Oct23 1:50   16528
> > 8031:pr   0.0  0.9   47  40  Oct23 0:17   16532
> > 8032:pr   0.0  0.9   44  38  Oct23 0:13   16536
> > 8035:dpr  0.2  0.7   36  30  12:30 0:02   22335
> > 8036:dpr  0.2  0.7   35  30  12:30 0:02   22338
> > 8037:dpr  0.2  0.7   35  30  12:30 0:02   22341
> >
> >
> > (the ones starting with D are in dev mode, will try turning them off
> > tonight, I hadn't considered this a spill over issue, but it happened just
> > now and turning them off didn't ease it. We had alot less when the problem
> > was occurring before, but also a 1 box set-up).
> >
> >
> > Its the 8020-8022 ones that have trouble. It is indeed picking up the
> > page cache, and while its happening I can go to one of those pages in
> > question, or cat it in SSH with no problems. I've monitored the rails log
> > while it was happening and haven't seen any EBADF spilling over. Though
> > its conceivable that a spike of hits from Google crawl could cause a
> > problem, I could try siege/ab tonight.
> >
> >
> > Not familiar with checking file limits, but this is what I get from
> > googling a command:
> >
> >
> > cat /proc/sys/fs/file-nr:
> > 2920    0       372880
> >
> >
> > ulimit -a:
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 1024
> > max locked memory       (kbytes, -l) 32
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 1024
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > stack size              (kbytes, -s) 10240
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) 77823
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> >
> >
> >
> > Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, and
> > see how close its getting to 1024. (Though again it seems strange that it
> > could be having a max open problem when the other non-cached pages on that
> > pid that are definitely opening files work fine).
> >
> >
> > Thanks again,
> >
> >
> > Zach
> >
> >
> >
> > On 10/23/07, Dave Cheney <dave at cheney.net> wrote:
> > >
> > > Are you using a web server in front of your mongrels ? It should be
> > > picking up the page cached file before even considering handing the request
> > > to a mongrel.
> > >
> > > Cheers
> > >
> > >
> > > Dave
> > >
> > > On 24/10/2007, at 11:30 AM, Zachary Powell wrote:
> > >
> > > close(5)                                = 0
> > > close(5)                                = -1 EBADF (Bad file
> > > descriptor)
> > > read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk
> > > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384)
> > > = 473
> > > close(5)                                = 0
> > > close(5)                                = -1 EBADF (Bad file
> > > descriptor)
> > >
> > >
> > > the file its trying to get is page cached, and exists/is fine (can
> > > even go to url while this is going on).
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071029/2bd0205e/attachment-0001.html 


More information about the Mongrel-users mailing list