[Backgroundrb-devel] Using Backgroundrb?

Julien Genestoux julien.genestoux at gmail.com
Mon Apr 28 16:43:17 EDT 2008


Thanks a lot for this very helpful answer.
I implement a solution very similar to yours and it runs, but I have 2
big problems.

The first one is "throughput".

If I have a periodic timer of 1 minute, I can only parse 20 (number of
threads) feeds per minute, which leads to 1200 per hour (since I want
to parse a feed at least once every hour). The problem is that I
really need to be able to parse at least 10 times this number of
feeds... and probably closer to 100k! What if I increase the number of
threads? Will I be able to parse more feeds?


The second one is actually a lot worse. I've had my system running for
a little more than a day... without monitoring it, and well, this
morning, everything was "down". I did a "ps aux" and here is what I
got :

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     21697  0.0  0.8  32524 15620 ?        D    Apr27   0:13 ruby
/mnt/app/current/script/backgroundrb start -e production
root     21698  0.0  0.2  32504  4736 ?        D    Apr27   0:08 ruby
log_worker
root     21699  1.1 90.5 2170872 1576364 ?     D    Apr27  25:58 ruby
parser_worker


As you can see, my parser_worker is consuming a little over 1,5Gb of
RAM : wayyyy too much ;) it seems that the vars are not destroyed in
my worker? Any idea of what's wrong?

Thanks a lot once again for your help!

Best,







On 4/25/08, Stevie Clifton <stevie at slowbicycle.com> wrote:
> Hey Julien,
>
>  It sounds like you are planning on using one "long running" feed
>  parsing loop with a do...while.  This is exactly the sort of thing you
>  want to avoid in new bdrb, especially if you know you want to do
>  something at discrete time periods--it totally goes against the
>  twisted paradigm.  After thinking about it for a bit, I would
>  recommend setting just one periodic_timer for every minute, and then
>  determining in your parse_feeds method which feeds need to be parsed.
>  If I were you, I wouldn't use last_updated to determine when to parse
>  your feeds -- it adds unnecessary complexity to your system.  You can
>  of course save that value for reference, but it's not necessary for
>  your requirements.
>
>  In your db you could have a field for every feed call "interval" that
>  would determine the minute intervals to parse the feeds.  Then every
>  minute when parse_feed gets called, you could parse every feed with an
>  interval of "1", and then determine based on the current minute in the
>  hour whether or not to try to parse the 15, 30, or 60 minute feeds.
>  And you'll of course want to use thread_pool.defer.  So, using Paul's
>  code as a starting point, something like this:
>
>  def parse_feeds
>   feeds = Feed.find_feeds_to_process
>   feeds.each do |feed|
>
>     thread_pool.defer do
>       feed.parse
>     end
>   end
>  end
>
>
> class Feed
>   def find_feeds_to_process
>     feeds = []
>     [1, 15, 30, 60].each |interval|
>       feeds << Feeds.find_by_interval( interval ) if Time.now.min %
>  interval == 0
>     end
>   end
>   def parse
>     # parsing code
>   end
>  end
>
>  On my way home yesterday I thought of another sexy addition you could
>  add to this.  In the above code, you know that you'll be parsing
>  _every_ feed in your db on the hour, which isn't a very efficient
>  setup.  If possible, you want to set it up so that you have even
>  parsing distribution throughout the hour so you're not getting
>  hammered.  You could add a pretty simple heuristic that would give you
>  a relatively even distribution across the hour by using the hash of
>  the feed url.  Along with the url and the interval, save an "offset"
>  value like this example:
>
>  feed = Feed.new
>  feed.url = ''my_feed_url'
>  feed.interval = 15
>  feed.offset = feed.url.hash % 60
>  feed.save
>
>  Then in find_feeds_to_process, you can do this (untested):
>
>  # the select will return any feed for which its interval offset
>  matches the current minute's offset for the same interval
>  def find_feeds_to_process
>   feeds = Feed.find(:all).select do |feed|
>     [15, 30, 60].detect { |interval| feed.offset % interval ==
>  Time.now.min % interval }
>   end
>  end
>
>  Doing a Feed.find(:all) is probably not the best idea if you have a
>  ton of records, so you might want to do multiple db finds to get the
>  same results.
>
>  stevie
>
>
>  On Wed, Apr 23, 2008 at 5:46 PM, Julien Genestoux
>
> <julien.genestoux at gmail.com> wrote:
>  > Thanks guys... that's a ton of info! I am definetely gonna use the
>  >  thread_pool... as soon as I can find the documentation ;D
>  >
>  >  1- For each feed, I define a "frequency" (every minute, every hour...
>  >  every 30 minutes...) that will be updated every time I'm parsing the
>  >  feed: if the parser returns "new" element, I am increasding the
>  >  frequency (from 1 time per hour, to 1 time per 30 min.), if not, I'm
>  >  decreasing the frequency...
>  >
>  >  2- I also have a "last_update" field which remembers the time when the
>  >  feed was parsed for the last time.
>  >
>  >  3- With 1 & 2, I know how "late" I am to parse a feed... so when I
>  >  choose my next feed to parse, I am always choosing the one that is the
>  >  most "late"
>  >
>  >  I am not sure if Steevie's approach of having multiple tasks for the
>  >  worker applies here. Actually, I am not even schedulling my worker, I
>  >  am just launching it once, and the parse_feeds runs forever (while
>  >  true do... end)
>  >
>  >  Also, if I understand well Paul's code, his approach allows my worker
>  >  to be more efficient always, but doesn't take into account the
>  >  "lateness" of my feeds.
>  >
>  >
>  >  My idea would be to add/remove worker according to "how late" I am in
>  >  parsing feeds.
>  >  If my the the lastest feed is late by more than 10min, I would add one
>  >  worker... and If my latest feed is late by less than 5 minutes, I
>  >  would remove one worker
>  >
>  >  Does this approach makes sense to you?
>  >
>  >  Thanks a lot for your help guys...
>  >
>  >
>  >
>  >
>  >  On 4/23/08, Paul Kmiec <paul.kmiec at appfolio.com> wrote:
>  >  > You can use the built build thread pool to process more than one feed within
>  >  > the same worker. So within the worker, you'd do,
>  >  >
>  >  > def parse_feeds
>  >  >   loop do
>  >  >     feed = Feed.find_feed_to_process
>  >  >     thread_pool.defer do
>  >  >        feed.parse
>  >  >     end
>  >  >   end
>  >  > end
>  >  >
>  >  > I think the default pool size is 20. You can control the size of the thread
>  >  > pool using a class level method, as I recall it is
>  >  >
>  >  > pool_size x
>  >  >
>  >  > Paul
>  >  >
>  >  >
>  >  >  On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux
>  >  > <julien.genestoux at gmail.com> wrote:
>  >  > > Thanks Adam,
>  >  > >
>  >  > > That sounded weird to me as well to have one worker for each feed...
>  >  > > However, if I only have one worker, that also means that I am parsing
>  >  > > one feed only at any moment. An option, is maybe to have a few workers
>  >  > > (denpending on the number of feeds)  that parse feeds concurrently?
>  >  > >
>  >  > > If I only have one worker, according to you what should be the
>  >  > > winnning strategy to choose the "right" parse to feed? Obviously some
>  >  > > feeds need to be parsed one every few minutes, while some other might
>  >  > > no need to be parse more than every hour...
>  >  > >
>  >  > > Any idea/tip on this?
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > > On 4/23/08, Adam Williams <adam at thewilliams.ws> wrote:
>  >  > > > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
>  >  > > >
>  >  > > >  > I still have a few questions : shoud I have one worker for each feed
>  >  > > >  > that is called periodically (add_periodic_timer) or rather one single
>  >  > > >  > worker that calls every feed one by one?
>  >  > > >  >
>  >  > > >  > What is the best solution, perfomance-wise?
>  >  > > >
>  >  > > >
>  >  > > > Good question... I don't suppose I know exactly. I would start by
>  >  > > >  processing all the feeds in one worker invocation - that is what I
>  >  > > >  have done for sending an unknown amount of email. It just seems wrong
>  >  > > >  to me to invoke a worker for one email at a time.
>  >  > > >
>  >  > > >  The right answer likely lies in understanding the whole MasterWorker,
>  >  > > >  Packet::Reactor/handler_instance.ask_work bits of the
>  >  > puzzle...
>  >  > > >
>  >  > > >
>  >  > > >     adam
>  >  > > >
>  >  > > > _______________________________________________
>  >  > > >  Backgroundrb-devel mailing list
>  >  > > >  Backgroundrb-devel at rubyforge.org
>  >  > > >
>  >  > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >  > > >
>  >  > >
>  >  > >
>  >  > >
>  >  > > --
>  >  > > --
>  >  > > Julien Genestoux
>  >  > > julien.genestoux at gmail.com
>  >  > > http://www.ouvre-boite.com
>  >  > > +1 (415) 254 7340
>  >  > > +33 (0)8 70 44 76 29
>  >  > > _______________________________________________
>  >  > >
>  >  > >
>  >  > >
>  >  > > Backgroundrb-devel mailing list
>  >  > > Backgroundrb-devel at rubyforge.org
>  >  > > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >  > >
>  >  >
>  >  >
>  >
>  >
>  >  --
>  >
>  >
>  > --
>  >  Julien Genestoux
>  >  julien.genestoux at gmail.com
>  >  http://www.ouvre-boite.com
>  >  +1 (415) 254 7340
>  >  +33 (0)8 70 44 76 29
>  >  _______________________________________________
>  >  Backgroundrb-devel mailing list
>  >  Backgroundrb-devel at rubyforge.org
>  >  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >
>


-- 
--
Julien Genestoux
julien.genestoux at gmail.com
http://www.ouvre-boite.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29


More information about the Backgroundrb-devel mailing list