[Rubygems-developers] Network traffic conservation strategies

Gavin Sinclair gsinclair at soyabean.com.au
Sun Mar 28 18:04:29 EST 2004


>> BTW the thing I had in mind was local caching of remote specification
>> lists (which are called Gem::Cache, which I don't understand).  This
>> would speed up remote listing and searching a great deal, and would
>> become especially important as the number of remote gems and the
>> number of users grow.
>>
>> It would require some thought about synchronising the remote and local
>> versions.
> 
> Well, we can keep the date of the retrieved yaml file from the source
> (gems.rubyforge.org) and then perform a HEAD instead of a GET on that
> file, using what we have it its up to date.  That would save the
> download.

That's true.  Pro: simplicity (and therefore hopefully reliability).
Con: you download 100K file when there's only 1K change.  (And that
file only gets larger over time, not smaller :)

I don't know whether it's *worth* making a greater effort than that to
conserve network traffic, but I've made some notes on a possible
implementation that's intended to improve client *and* server
performance.

My idea would be to run a DRb service from 'gem_server' (in addition
to its normal webrick activities).  Anyway, here is a view from the
client's POV on updating the local cache.

  - client calculates MD5 hash of local cache
  - client requests hash of server's specification list
  - if they match, then we're done
  - if not, then
    - client sends server array of gem names that it has
    - server returns a specification list containing the missing
      pieces
    - client updates its cache
    - now, for sanity check, the client
      - calculates its hash again
      - compares with the hash value the server returned earlier
      - it should match, but if it doesn't, then the client's cache
        is stuffed up for some reason, so it asks the server for a
        complete specification list and replaces its cache

What do you think of that?  Does it seem like it would perform
reasonably well?  Would it be easy enough to implement it?  Are there
potential stuffups?  Would it be worth the effort?

>From the server's point of view, it doesn't want to be calculating MD5
hashes every damn time a client asks for something.  Using DRb would
make caching quite easy.

Here's how the server would work:

  - run a DRb service that can accept multiple simultaneous connections
  - server implements methods to:
    - return hash of its specification list
    - return missing specification objects (given array of gem names)
    - return all specification objects
  - for performance, server caches everything (hash and spec list)
    - a separate thread runs every N minutes to rebuild spec list
      and recalulate cache

I think this is easy enough to implement and should be pretty
reliable.  I'm certainly willing to give it a try.  The beauty of
running a DRb service is that caching is so simple, which means the
server can be hammered harder.

The motivation for all this is as follows: if the client has a cache
of all specifications, then:
 - all remote operations look like this:
   - update cache
   - operate on local list (except for actual gem download)
 - the user can do a pretty good "remote" search even when not
   connected to the internet

Looking forward to comments.

Cheers,
Gavin


# In a nice twist, implementing this sort of thing would have some
# bearing on what I'm doing at work, which involves remote
# synchronisation.





More information about the Rubygems-developers mailing list