[Mongrel] random cpu spikes, EBADF errors

Jim Hogue jjhogue at sbcglobal.net
Wed Oct 24 11:07:52 EDT 2007


Assuming you have access to a root shell on your hosting service, then 
when you are watching a slow system, you can start up a root shell and 
renice that shell to a value of -1.  This will give the shell a higher 
priority and your command will be much quicker.  For example.

bash #: su
Password:
root #: ps
  PID TTY          TIME CMD
 1443 pts/2    00:00:00 bash
 1447 pts/2    00:00:00 ps
root #: renice -1 1443
root #: suspend
bash #:

Use the suspend to leave the root job around and then foreground it 
when you have the problem and use it.

If you don't have a root shell, but have sudo, then use

sudo nice -1 ps of -p 27787
etc.

    - Jim Hogue

----- Original Message ----- 
From: "Zachary Powell" <zach at plugthegap.co.uk>
Cc: <mongrel-users at rubyforge.org>
Sent: Wednesday, October 24, 2007 9:14 AM
Subject: Re: [Mongrel] random cpu spikes, EBADF errors


> Hi, short follow up. I did
> lsof -p 27787 | grep mongrel  -c
> => 64 files open
>
> while it was spiking, so it doesn't look like a max file issue. 
> Looking
> thorough the list, the only thing of note was
>
> mongrel_r 27787 rails    5w  unknown
> /proc/27787/fd/5 (readlink: No such file or directory)
>
> 5 was likely the id of the file getting EBADF (usually 4-6). I 
> didn't catch
> which file it was before it the problem ended (running lsof was very 
> slow,
> took 20 secs while problem was occuring), but previously when I've 
> checked
> its always been there and accessable from web. Also, the dir in 
> question is
> full of files, so it hasn't been sweeped recently (couldn't have 
> been
> cache-expiry while reading or anything weird like that).
>
> Zach
>
>
>
> On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote:
>>
>> Hi Guys,
>>
>> I'm using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, 
>> and Ruby
>> 1.8.6, running Red Hat Enterprise Linux ES 4 . We have
>>
>>
>> Web/Mysql/Mail server:
>>
>> # RAID Configuration: RAID 1 (73GBx2)
>>
>> # HP Memory: 2 GB HP RAM
>>
>> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz
>>
>>
>> (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue 
>> occurs
>> with litespeed taking ~30% cpu)
>>
>>
>> and App Server:
>>
>>
>> # RAID Configuration: RAID 1 (146GBx2)
>>
>> # HP Memory: 4 GB HP RAM
>>
>> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz
>>
>>
>> (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend
>> processes. Spikes up to 2-4 when it happens, depending on how many 
>> mongrels
>> get the problem (sometimes 2))
>>
>>
>> And these are the  mongrels running:
>>
>>
>> MONGREL   CPU  MEM  VIR RES  DATE  TIME   PID
>> 8010:hq   3.2  3.4  145 138  Oct23 43:13  20409
>> 8011:hq   0.6  3.0  132 125  Oct23 8:15   20412
>> 8012:hq   0.1  1.8   81  74  Oct23 1:28   20415
>> 8015:dhq  0.0  1.0   50  44  02:41 0:08   4775
>> 8016:dhq  0.0  0.7   34  30  02:41 0:01   4778
>> 8017:dhq  0.0  0.7   36  30  02:41 0:01   4781
>> 8020:af   9.0  3.3  143 137  Oct23 114:41 26600
>> 8021:af   5.6  2.0   90  84  Oct23 71:56  26607
>> 8022:af   2.4  1.8   80  74  Oct23 30:37  26578
>> 8025:daf  0.0  1.0   49  42  02:41 0:04   4842
>> 8026:daf  0.0  0.7   34  30  02:41 0:02   4845
>> 8027:daf  0.0  0.7   36  30  02:41 0:02   4848
>> 8030:pr   0.1  1.5   67  61  Oct23 1:50   16528
>> 8031:pr   0.0  0.9   47  40  Oct23 0:17   16532
>> 8032:pr   0.0  0.9   44  38  Oct23 0:13   16536
>> 8035:dpr  0.2  0.7   36  30  12:30 0:02   22335
>> 8036:dpr  0.2  0.7   35  30  12:30 0:02   22338
>> 8037:dpr  0.2  0.7   35  30  12:30 0:02   22341
>>
>>
>> (the ones starting with D are in dev mode, will try turning them 
>> off
>> tonight, I hadn't considered this a spill over issue, but it 
>> happened just
>> now and turning them off didn't ease it. We had alot less when the 
>> problem
>> was occurring before, but also a 1 box set-up).
>>
>>
>> Its the 8020-8022 ones that have trouble. It is indeed picking up 
>> the page
>> cache, and while its happening I can go to one of those pages in 
>> question,
>> or cat it in SSH with no problems. I've monitored the rails log 
>> while it was
>> happening and haven't seen any EBADF spilling over. Though
>> its conceivable that a spike of hits from Google crawl could cause 
>> a
>> problem, I could try siege/ab tonight.
>>
>>
>> Not familiar with checking file limits, but this is what I get from
>> googling a command:
>>
>>
>> cat /proc/sys/fs/file-nr:
>> 2920    0       372880
>>
>>
>> ulimit -a:
>> core file size          (blocks, -c) 0
>> data seg size           (kbytes, -d) unlimited
>> file size               (blocks, -f) unlimited
>> pending signals                 (-i) 1024
>> max locked memory       (kbytes, -l) 32
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 1024
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> stack size              (kbytes, -s) 10240
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) 77823
>> virtual memory          (kbytes, -v) unlimited
>> file locks                      (-x) unlimited
>>
>>
>>
>>
>> Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, 
>> and see
>> how close its getting to 1024. (Though again it seems strange that 
>> it could
>> be having a max open problem when the other non-cached pages on 
>> that pid
>> that are definitely opening files work fine).
>>
>>
>> Thanks again,
>>
>>
>> Zach
>>
>>
>>
>> On 10/23/07, Dave Cheney <dave at cheney.net> wrote:
>> >
>> > Are you using a web server in front of your mongrels ? It should 
>> > be
>> > picking up the page cached file before even considering handing 
>> > the request
>> > to a mongrel.
>> >
>> > Cheers
>> >
>> >
>> > Dave
>> >
>> > On 24/10/2007, at 11:30 AM, Zachary Powell wrote:
>> >
>> > close(5)                                = 0
>> > close(5)                                = -1 EBADF (Bad file 
>> > descriptor)
>> > read(5, "GET 
>> > /flower_delivery/florists_in_covehithe_suffolk_england_uk
>> > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, 
>> > image"..., 16384)
>> > = 473
>> > close(5)                                = 0
>> > close(5)                                = -1 EBADF (Bad file 
>> > descriptor)
>> >
>> >
>> > the file its trying to get is page cached, and exists/is fine 
>> > (can even
>> > go to url while this is going on).
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>


--------------------------------------------------------------------------------


> _______________________________________________
> Mongrel-users mailing list
> Mongrel-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mongrel-users 



More information about the Mongrel-users mailing list