[Backgroundrb-devel] Problems sending large results with backgroundrb

Mike Evans mike at metaswitch.com
Tue May 20 15:30:51 EDT 2008


I'm working on an application that does extensive database searching.
These searches can take a long time, so we have been working on moving
the searches to a backgroundrb worker task so we can provide a sexy AJAX
progress bar, and populate the search results as they are available.
All of this seems to work fine until the size of the search results gets
sufficiently large, when we start to hit exceptions in backgroundrb
(most likely in the packet layer).  We are using packet-0.5.1 and
backgroundrb from the latest svn mirror.

We have found and fixed one problem in the packet sender.  This is
triggered when the non-blocking send in NbioHelper::send_once cannot
send the entire buffer, resulting in an exception in the line 

      write_scheduled[fileno] ||= connections[fileno].instance

in Core::schedule_write because connections[fileno] is nil.  I can't
claim to fully understand the code, but I think there are two problems
here.

The main issue seems to be that when Core::handle_write_event calls
write_and_schedule to schedule the write, it doesn't clear out
internal_scheduled_write[fileno].  It looks like the code is expecting
the cancel_write call at the end of write_and_schedule to clear it out,
but this doesn't happen if there is enough queued data to cause the
non-blocking write to only partially succeed again.  In this case,
Core::schedule_write is called again, and because
internal_schedule_write[fileno] has not been cleared out, the code drops
through to the second if test, then hits the above exception.  We fixed
this by adding the line

	internal_scheduled_write.delete(fileno)

immediately before the call to write_and_schedule in
Core::handle_write_event.

The secondary issue is that the connections[fileno] structure is not
getting populated for this connection - I'm guessing because it is an
internal socket rather than a network socket, but I couldn't be sure.
We changed the second if test in Core::schedule_write to

      elsif write_scheduled[fileno].nil? && !connections[fileno].nil?

to firewall against this, but we are not sure if this is the right fix.

We are now hitting problems in the Packet::MetaPimp module receiving the
data, usually an exception in the Marshal.load call in
MetaPimp::receive_data.  We suspect this is caused by the packet code
corrupting the data somewhere, probably because we are sending such
large arrays of results (the repro I am working on at the moment is
trying to marshal over 200k of data).  We've been trying to put extra
diagnostics in the code so we can see what is happening, but if we edit
puts statements into the code we only seem to get output from the end of
the connection that hits an exception and so far our attempts to make
logger objects available throughout the code have failed.  We therefore
thought we would ask for help - either to see whether this is a known
problem, or whether there is a recommended way to add diagnostics to the
packet code.

I'm also open to ideas as to better ways to solve the problem!

Thanks in advance,

Mike



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://rubyforge.org/pipermail/backgroundrb-devel/attachments/20080520/aad353ae/attachment.html>


More information about the Backgroundrb-devel mailing list