[Rubygems-developers] Roadmap for version 0.8.12

Jim Weirich jim at weirichhouse.org
Mon Jul 18 01:20:28 EDT 2005

This is an outline of my plan for version 0.8.12.  The primary goal of
this release is to speed up the source index update for the gem
command and to decrease the load on the RubyForge gem server.

My idea covers two areas (1) incremental updates, and (2) archiving
old gems.

(1) Incremental Updates

Currently the static gem server used by RubyForge has the following
directory structure:

* topdir
  * yaml (uncompressed source index)
  * yaml.Z (compressed source index)
  * gems
    * aaa.gem
    * bbb.gem
    * ...

The proposal would change this to:

* topdir
  * yaml (uncompressed source index)
  * yaml.Z (compressed source index)
  * quick
    * index
    * index.Z
    * aaa.gemspec
    * aaa.gemspec.Z
    * bbb.gemspec
    * bbb.gemspec.Z
    * ...
  * gems
    * aaa.gem
    * bbb.gem
    * ...

The main addition is the "quick" directory.  The quick/index file
contains a list of gems and a MD5 hash of the corresponding gemspec.
Something like this:

  builder-0.1.1 68673a832739659790fb02e4227d226f
  builder-1.0.0 9a85bbd956dafd5100de7b6f6cafe4ea
  builder-1.2.1 43cd2a1edceedebbb46fed45b29b596a
  builder-1.2.3 952bb54040e301b009d7fb646fff1f15
  flexmock-0.0.3 cc235f560a20c103807c299599317d8b
  rake-0.4.14 367df8d0cca063f3fb132dd00dac1a82
  rake- fcdf87575c74ca96ffe53061529f9388
  rake-0.4.15 1207708d1a0e1fb4979e8d6f8b9b45d1
  rake- 9709b57cc070895261e2d7f48252d3b3
  rake-0.5.3 19ecc210059f0e67421c41343006b449
  rake- 370ab5df156158417a0a05b0287dc337

The first thing the gem client will do is download the quick index.
If the quick index is not available, it will fall back to the current
source update algorithm.  This allows the new gem command to remain
compatible with servers running the old software.  In particular, if
we decide to /not/ update the gem_server command, it will still
continue to work.

Once the quick/index is downloaded, the gem command will audit the
list of known gems for the given source and determine two lists: (a)
the gems that are no longer available on the server, and (b) the gems
that are new or updated (different MD5 hashes).  The gemspecs in the
the (a) list will just be dropped from the cached source list.  The
gemspecs in (b) will be downloaded from the quick directory and used
to incrementally update the cached source index.

The gemspecs available in the quick directory will be in "abbreviated"
format.  Abbreviated gemspecs don't contain the entire gemspec found
in the .gem file.  In particular, the 'files' field is empty.  This
makes the spec *much* smaller.  Since this copy of the gemspec is only
used for searches in the cache, omitting this data is not a problem.
In fact, the current software abbreviates the gemspec on the client
side to keep the size of the source cache smaller (we did this when we
were having problems loading large yaml files).  Abbreviating the
gemspec on the server side means we get the benifit of smaller and
quicker downloads as well.

The net effect of this change is that instead of download a *huge*
yaml file everytime someone updates a gem (which happens every day),
we only download a reasonably sized index file, plus the (abbreviated)
gemspecs that have changed since the last time we synced the source
index cache.

I haven't run the numbers yet on anticipated savings, but I'm thinking
that it could be quite significant.

Some open questions:

(A) I show storing both the index and index.Z, both the gemspec and
    gemspec.Z.  Is there any reason to store the non-compressed

(B) If there are a *lot* of gems that are out of date, there may be a
    point were downloading a single, large yaml file may be more
    efficient than downloading a bunch of smaller files.  At that
    point we could switch to the current algorithm.

I have a generate_yaml_index.rb file updated to generate the quick
index (not checked in yet, but will be soon).  I haven't updated the
client algorithm yet.

(2) Archiving Old Gems

Part of the problem is that one a gem is available on RubyForge, it is
always available on RubyForge.  The source index keeps expanding and
expanding.  If we don't have an archival policy, this will continue to
grow until we need to fix it again.

While it is cool that we have all Rails version back to verion 0.5 (or
whatever, I didn't actually look it up), I don't think anyone is
really interested in downloading it anymore.

Here's a proposal.  We update the generate_yaml_index.rb command to
move gems to an archive directory as long as both of the following
conditions are true:

* All gems uploaded in the last M months are directly available.
* At least min(N,V) versions of a gem are available (where V is total
  number of versions uploaded)

Of course, we can pick M and N to be whatever makes sense.  For
example M=12 and N=3 would keep the last three version around, plus
any versions available in the last year.  I imagine that M and N will
be command line arguments to the generate_yaml_script.rb to that
others running their own servers can set their own policies.

Gems that are archived will be moved from the gem directory to an
archive/gem directory.  The archive directory will have its own yaml
file and quick index (in the format described in part (1) above).
This means that gems moved the archive are still available by using
the --source option on the command line.  For example, if rails-0.5 is
in the archive, I can still get it with the command:

  gem install --version=0.5 rails --source=http://gems.rubyforge.org/archive

By incorporating the archive logic in the generate_yaml_index.rb
command, it becomes very easy to manage a gem server.

Open questions:

(A) There is a date in the gemspec.  Should the algorithm use the
    gemspec date, or the date of the file in the file system?

(3) Unrelated Question:

I am considering renaming the generate_yaml_index.rb command to
gem_index (or maybe gem_server_index ... suggestions welcome).  This
bring the command in line with the convention that all gem releated
command begin with the letters "gem".  And it does more then just
generate yaml, a new name will reflect this better.  Any comments?

I see this as the last change for the 0.8.xxx series for RubyGems.  I
would like version 0.9.xxx to start addressing the differences in
behavior in the local and remote installers and to start handling
platform issues more intelligently.  I've got some thoughts on this
that I will write up separately.

Feedback on the above?

-- Jim Weirich     jim at weirichhouse.org    http://onestepback.org
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)

More information about the Rubygems-developers mailing list