[Mongrel] random cpu spikes, EBADF errors

Jim Hogue jjhogue at sbcglobal.net
Wed Oct 24 11:21:59 EDT 2007


Whoops, that nice command needs a --1 as in

sudo nice --1 ps of -p 27787

    - Jim

----- Original Message ----- 
From: "Jim Hogue" <jjhogue at sbcglobal.net>
To: <zach at plugthegap.co.uk>; <mongrel-users at rubyforge.org>
Sent: Wednesday, October 24, 2007 11:07 AM
Subject: Re: [Mongrel] random cpu spikes, EBADF errors


> Assuming you have access to a root shell on your hosting service, 
> then
> when you are watching a slow system, you can start up a root shell 
> and
> renice that shell to a value of -1.  This will give the shell a 
> higher
> priority and your command will be much quicker.  For example.
>
> bash #: su
> Password:
> root #: ps
>  PID TTY          TIME CMD
> 1443 pts/2    00:00:00 bash
> 1447 pts/2    00:00:00 ps
> root #: renice -1 1443
> root #: suspend
> bash #:
>
> Use the suspend to leave the root job around and then foreground it
> when you have the problem and use it.
>
> If you don't have a root shell, but have sudo, then use
>
> sudo nice -1 ps of -p 27787
> etc.
>
>    - Jim Hogue
>
> ----- Original Message ----- 
> From: "Zachary Powell" <zach at plugthegap.co.uk>
> Cc: <mongrel-users at rubyforge.org>
> Sent: Wednesday, October 24, 2007 9:14 AM
> Subject: Re: [Mongrel] random cpu spikes, EBADF errors
>
>
>> Hi, short follow up. I did
>> lsof -p 27787 | grep mongrel  -c
>> => 64 files open
>>
>> while it was spiking, so it doesn't look like a max file issue.
>> Looking
>> thorough the list, the only thing of note was
>>
>> mongrel_r 27787 rails    5w  unknown
>> /proc/27787/fd/5 (readlink: No such file or directory)
>>
>> 5 was likely the id of the file getting EBADF (usually 4-6). I
>> didn't catch
>> which file it was before it the problem ended (running lsof was 
>> very
>> slow,
>> took 20 secs while problem was occuring), but previously when I've
>> checked
>> its always been there and accessable from web. Also, the dir in
>> question is
>> full of files, so it hasn't been sweeped recently (couldn't have
>> been
>> cache-expiry while reading or anything weird like that).
>>
>> Zach
>>
>>
>>
>> On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote:
>>>
>>> Hi Guys,
>>>
>>> I'm using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1,
>>> and Ruby
>>> 1.8.6, running Red Hat Enterprise Linux ES 4 . We have
>>>
>>>
>>> Web/Mysql/Mail server:
>>>
>>> # RAID Configuration: RAID 1 (73GBx2)
>>>
>>> # HP Memory: 2 GB HP RAM
>>>
>>> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 
>>> GHz
>>>
>>>
>>> (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue
>>> occurs
>>> with litespeed taking ~30% cpu)
>>>
>>>
>>> and App Server:
>>>
>>>
>>> # RAID Configuration: RAID 1 (146GBx2)
>>>
>>> # HP Memory: 4 GB HP RAM
>>>
>>> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 
>>> GHz
>>>
>>>
>>> (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend
>>> processes. Spikes up to 2-4 when it happens, depending on how many
>>> mongrels
>>> get the problem (sometimes 2))
>>>
>>>
>>> And these are the  mongrels running:
>>>
>>>
>>> MONGREL   CPU  MEM  VIR RES  DATE  TIME   PID
>>> 8010:hq   3.2  3.4  145 138  Oct23 43:13  20409
>>> 8011:hq   0.6  3.0  132 125  Oct23 8:15   20412
>>> 8012:hq   0.1  1.8   81  74  Oct23 1:28   20415
>>> 8015:dhq  0.0  1.0   50  44  02:41 0:08   4775
>>> 8016:dhq  0.0  0.7   34  30  02:41 0:01   4778
>>> 8017:dhq  0.0  0.7   36  30  02:41 0:01   4781
>>> 8020:af   9.0  3.3  143 137  Oct23 114:41 26600
>>> 8021:af   5.6  2.0   90  84  Oct23 71:56  26607
>>> 8022:af   2.4  1.8   80  74  Oct23 30:37  26578
>>> 8025:daf  0.0  1.0   49  42  02:41 0:04   4842
>>> 8026:daf  0.0  0.7   34  30  02:41 0:02   4845
>>> 8027:daf  0.0  0.7   36  30  02:41 0:02   4848
>>> 8030:pr   0.1  1.5   67  61  Oct23 1:50   16528
>>> 8031:pr   0.0  0.9   47  40  Oct23 0:17   16532
>>> 8032:pr   0.0  0.9   44  38  Oct23 0:13   16536
>>> 8035:dpr  0.2  0.7   36  30  12:30 0:02   22335
>>> 8036:dpr  0.2  0.7   35  30  12:30 0:02   22338
>>> 8037:dpr  0.2  0.7   35  30  12:30 0:02   22341
>>>
>>>
>>> (the ones starting with D are in dev mode, will try turning them
>>> off
>>> tonight, I hadn't considered this a spill over issue, but it
>>> happened just
>>> now and turning them off didn't ease it. We had alot less when the
>>> problem
>>> was occurring before, but also a 1 box set-up).
>>>
>>>
>>> Its the 8020-8022 ones that have trouble. It is indeed picking up
>>> the page
>>> cache, and while its happening I can go to one of those pages in
>>> question,
>>> or cat it in SSH with no problems. I've monitored the rails log
>>> while it was
>>> happening and haven't seen any EBADF spilling over. Though
>>> its conceivable that a spike of hits from Google crawl could cause
>>> a
>>> problem, I could try siege/ab tonight.
>>>
>>>
>>> Not familiar with checking file limits, but this is what I get 
>>> from
>>> googling a command:
>>>
>>>
>>> cat /proc/sys/fs/file-nr:
>>> 2920    0       372880
>>>
>>>
>>> ulimit -a:
>>> core file size          (blocks, -c) 0
>>> data seg size           (kbytes, -d) unlimited
>>> file size               (blocks, -f) unlimited
>>> pending signals                 (-i) 1024
>>> max locked memory       (kbytes, -l) 32
>>> max memory size         (kbytes, -m) unlimited
>>> open files                      (-n) 1024
>>> pipe size            (512 bytes, -p) 8
>>> POSIX message queues     (bytes, -q) 819200
>>> stack size              (kbytes, -s) 10240
>>> cpu time               (seconds, -t) unlimited
>>> max user processes              (-u) 77823
>>> virtual memory          (kbytes, -v) unlimited
>>> file locks                      (-x) unlimited
>>>
>>>
>>>
>>>
>>> Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens,
>>> and see
>>> how close its getting to 1024. (Though again it seems strange that
>>> it could
>>> be having a max open problem when the other non-cached pages on
>>> that pid
>>> that are definitely opening files work fine).
>>>
>>>
>>> Thanks again,
>>>
>>>
>>> Zach
>>>
>>>
>>>
>>> On 10/23/07, Dave Cheney <dave at cheney.net> wrote:
>>> >
>>> > Are you using a web server in front of your mongrels ? It should
>>> > be
>>> > picking up the page cached file before even considering handing
>>> > the request
>>> > to a mongrel.
>>> >
>>> > Cheers
>>> >
>>> >
>>> > Dave
>>> >
>>> > On 24/10/2007, at 11:30 AM, Zachary Powell wrote:
>>> >
>>> > close(5)                                = 0
>>> > close(5)                                = -1 EBADF (Bad file
>>> > descriptor)
>>> > read(5, "GET
>>> > /flower_delivery/florists_in_covehithe_suffolk_england_uk
>>> > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg,
>>> > image"..., 16384)
>>> > = 473
>>> > close(5)                                = 0
>>> > close(5)                                = -1 EBADF (Bad file
>>> > descriptor)
>>> >
>>> >
>>> > the file its trying to get is page cached, and exists/is fine
>>> > (can even
>>> > go to url while this is going on).
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>
>
>
> --------------------------------------------------------------------------------
>
>
>> _______________________________________________
>> Mongrel-users mailing list
>> Mongrel-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mongrel-users
>
> _______________________________________________
> Mongrel-users mailing list
> Mongrel-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mongrel-users 



More information about the Mongrel-users mailing list