[Backgroundrb-devel] Using Backgroundrb?

Stevie Clifton stevie at slowbicycle.com
Fri Apr 25 10:54:13 EDT 2008


Hey Julien,

It sounds like you are planning on using one "long running" feed
parsing loop with a do...while.  This is exactly the sort of thing you
want to avoid in new bdrb, especially if you know you want to do
something at discrete time periods--it totally goes against the
twisted paradigm.  After thinking about it for a bit, I would
recommend setting just one periodic_timer for every minute, and then
determining in your parse_feeds method which feeds need to be parsed.
If I were you, I wouldn't use last_updated to determine when to parse
your feeds -- it adds unnecessary complexity to your system.  You can
of course save that value for reference, but it's not necessary for
your requirements.

In your db you could have a field for every feed call "interval" that
would determine the minute intervals to parse the feeds.  Then every
minute when parse_feed gets called, you could parse every feed with an
interval of "1", and then determine based on the current minute in the
hour whether or not to try to parse the 15, 30, or 60 minute feeds.
And you'll of course want to use thread_pool.defer.  So, using Paul's
code as a starting point, something like this:

def parse_feeds
  feeds = Feed.find_feeds_to_process
  feeds.each do |feed|
    thread_pool.defer do
      feed.parse
    end
  end
end

class Feed
  def find_feeds_to_process
    feeds = []
    [1, 15, 30, 60].each |interval|
      feeds << Feeds.find_by_interval( interval ) if Time.now.min %
interval == 0
    end
  end
  def parse
    # parsing code
  end
end

On my way home yesterday I thought of another sexy addition you could
add to this.  In the above code, you know that you'll be parsing
_every_ feed in your db on the hour, which isn't a very efficient
setup.  If possible, you want to set it up so that you have even
parsing distribution throughout the hour so you're not getting
hammered.  You could add a pretty simple heuristic that would give you
a relatively even distribution across the hour by using the hash of
the feed url.  Along with the url and the interval, save an "offset"
value like this example:

feed = Feed.new
feed.url = ''my_feed_url'
feed.interval = 15
feed.offset = feed.url.hash % 60
feed.save

Then in find_feeds_to_process, you can do this (untested):

# the select will return any feed for which its interval offset
matches the current minute's offset for the same interval
def find_feeds_to_process
  feeds = Feed.find(:all).select do |feed|
    [15, 30, 60].detect { |interval| feed.offset % interval ==
Time.now.min % interval }
  end
end

Doing a Feed.find(:all) is probably not the best idea if you have a
ton of records, so you might want to do multiple db finds to get the
same results.

stevie


On Wed, Apr 23, 2008 at 5:46 PM, Julien Genestoux
<julien.genestoux at gmail.com> wrote:
> Thanks guys... that's a ton of info! I am definetely gonna use the
>  thread_pool... as soon as I can find the documentation ;D
>
>  1- For each feed, I define a "frequency" (every minute, every hour...
>  every 30 minutes...) that will be updated every time I'm parsing the
>  feed: if the parser returns "new" element, I am increasding the
>  frequency (from 1 time per hour, to 1 time per 30 min.), if not, I'm
>  decreasing the frequency...
>
>  2- I also have a "last_update" field which remembers the time when the
>  feed was parsed for the last time.
>
>  3- With 1 & 2, I know how "late" I am to parse a feed... so when I
>  choose my next feed to parse, I am always choosing the one that is the
>  most "late"
>
>  I am not sure if Steevie's approach of having multiple tasks for the
>  worker applies here. Actually, I am not even schedulling my worker, I
>  am just launching it once, and the parse_feeds runs forever (while
>  true do... end)
>
>  Also, if I understand well Paul's code, his approach allows my worker
>  to be more efficient always, but doesn't take into account the
>  "lateness" of my feeds.
>
>
>  My idea would be to add/remove worker according to "how late" I am in
>  parsing feeds.
>  If my the the lastest feed is late by more than 10min, I would add one
>  worker... and If my latest feed is late by less than 5 minutes, I
>  would remove one worker
>
>  Does this approach makes sense to you?
>
>  Thanks a lot for your help guys...
>
>
>
>
>  On 4/23/08, Paul Kmiec <paul.kmiec at appfolio.com> wrote:
>  > You can use the built build thread pool to process more than one feed within
>  > the same worker. So within the worker, you'd do,
>  >
>  > def parse_feeds
>  >   loop do
>  >     feed = Feed.find_feed_to_process
>  >     thread_pool.defer do
>  >        feed.parse
>  >     end
>  >   end
>  > end
>  >
>  > I think the default pool size is 20. You can control the size of the thread
>  > pool using a class level method, as I recall it is
>  >
>  > pool_size x
>  >
>  > Paul
>  >
>  >
>  >  On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux
>  > <julien.genestoux at gmail.com> wrote:
>  > > Thanks Adam,
>  > >
>  > > That sounded weird to me as well to have one worker for each feed...
>  > > However, if I only have one worker, that also means that I am parsing
>  > > one feed only at any moment. An option, is maybe to have a few workers
>  > > (denpending on the number of feeds)  that parse feeds concurrently?
>  > >
>  > > If I only have one worker, according to you what should be the
>  > > winnning strategy to choose the "right" parse to feed? Obviously some
>  > > feeds need to be parsed one every few minutes, while some other might
>  > > no need to be parse more than every hour...
>  > >
>  > > Any idea/tip on this?
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  > > On 4/23/08, Adam Williams <adam at thewilliams.ws> wrote:
>  > > > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
>  > > >
>  > > >  > I still have a few questions : shoud I have one worker for each feed
>  > > >  > that is called periodically (add_periodic_timer) or rather one single
>  > > >  > worker that calls every feed one by one?
>  > > >  >
>  > > >  > What is the best solution, perfomance-wise?
>  > > >
>  > > >
>  > > > Good question... I don't suppose I know exactly. I would start by
>  > > >  processing all the feeds in one worker invocation - that is what I
>  > > >  have done for sending an unknown amount of email. It just seems wrong
>  > > >  to me to invoke a worker for one email at a time.
>  > > >
>  > > >  The right answer likely lies in understanding the whole MasterWorker,
>  > > >  Packet::Reactor/handler_instance.ask_work bits of the
>  > puzzle...
>  > > >
>  > > >
>  > > >     adam
>  > > >
>  > > > _______________________________________________
>  > > >  Backgroundrb-devel mailing list
>  > > >  Backgroundrb-devel at rubyforge.org
>  > > >
>  > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  > > >
>  > >
>  > >
>  > >
>  > > --
>  > > --
>  > > Julien Genestoux
>  > > julien.genestoux at gmail.com
>  > > http://www.ouvre-boite.com
>  > > +1 (415) 254 7340
>  > > +33 (0)8 70 44 76 29
>  > > _______________________________________________
>  > >
>  > >
>  > >
>  > > Backgroundrb-devel mailing list
>  > > Backgroundrb-devel at rubyforge.org
>  > > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  > >
>  >
>  >
>
>
>  --
>
>
> --
>  Julien Genestoux
>  julien.genestoux at gmail.com
>  http://www.ouvre-boite.com
>  +1 (415) 254 7340
>  +33 (0)8 70 44 76 29
>  _______________________________________________
>  Backgroundrb-devel mailing list
>  Backgroundrb-devel at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>


More information about the Backgroundrb-devel mailing list