Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle?

Eric Wong normalperson at yhbt.net
Tue Jun 19 17:54:09 UTC 2012

Ilya Maykov <ivmaykov at gmail.com> wrote:
> Hi all,
> We're using Rainbows + EventMachine + Sinatra::Synchrony to run a
> fleet of RESTful web servers backed by a Cassandra cluster. We are
> using the EventMachineTransport to talk to Cassandra with an
> EM::Synchrony::ConnectionPool in each rainbows worker. We have a Storm
> cluster pushing a large stream of real-time data into the Rainbows
> fleet using HTTP PUT requests. We're running into some very strange
> performance issues and need help figuring out what's going on.

I'm not at all familiar with Storm nor Cassandra.  How big are the HTTP
PUT requests Rainbows! is getting?  Is Storm pipelining HTTP requests by
any chance?  That may not do well with the EM portion of Rainbows!

> Basically, when load is low, everything looks good. When we crank up
> the load, all of a sudden the CPU gets pegged, request latencies go
> waaaay up, and requests start timing out. Once this state is reached,
> the high CPU usage (4 rainbows worker processes at ~50% each on a
> 2-core machine = nearly full load) remains even if we completely shut
> off all incoming traffic.

> Taking a look with strace -p, it looks like
> the rainbows processes are writing ascii NUL characters to file
> descriptor 7 (which is a FIFO) as fast as the kernel will let them. My
> guess is that the worker is trying to communicate with the rainbows
> master process via the FIFO.

No, rainbows doesn't have code to send "\0" to the master.  I
don't think EM does, either, maybe some other library you're

Which version of Ruby is this?  Try adding "-f" to follow threads
for a worker.

> Not sure what is triggering this
> behavior, but would like to know if anyone else has ever seen
> something like this. This thread sounded like it could've been a
> similar issue, but died out without any conclusion:
> http://rubyforge.org/pipermail/rainbows-talk/2012-April/000345.html
> Some details about the setup:
> 6-node cassandra cluster
> 3 nodes running rainbows web servers
> 4 rainbows workers per node
> max of 50 cassandra connections per rainbows worker
> rainbows.conf has:
> Rainbows! do
>   use :EventMachine
>   worker_connections 50
>   keepalive_requests 1000
>   keepalive_timeout  10
> end

Do you have preload_app set to true anywhere?  (Try leaving it as false
(the default))

Can you also try "keepalive_timeout 0" to disable keepalive?  (EM
handles it internally, but I'm not sure how well)

> So, each rainbows node can handle 4 * 50 = 200 simultaneous connections
> 12 Storm worker processes writing to the rainbows web servers
> each Storm worker has max of 10 connections open to each of the 3 rainbows nodes
> So, each rainbows node has 12 * 10 = 120 incoming connections from Storm.
> Have been playing around with the numbers, the bug (assuming it is a
> bug) seems to be easier to trigger when I increase the number of
> incoming connections (from Storm workers), even if they are a lot less
> than the rainbows servers can take (60-70% of the max connections is
> usually enough). The bug is also easier to trigger when we increase
> the volume of data we're pushing through Storm - hundreds or thousands
> of requests per minute, no bug - hundreds of thousands of requests per
> minute, yes bug. Cassandra is not the issue, it can easily take the
> write load we're generating and is basically idle.
> Any help in figuring this out would be greatly appreciated. Thanks,

Try my suggestions above.

I would also search your libs/gems for what's writing "\0" since I don't
think it's Rainbows!...

More information about the rainbows-talk mailing list