[Rubygems-developers] Updating source index is slow

Eivind Eklund eivind at FreeBSD.org
Wed Nov 10 12:50:42 EST 2004

On Wed, Nov 10, 2004 at 05:11:42PM +0000, Hugh Sasse Staff Elec Eng wrote:
> On Wed, 10 Nov 2004, Eivind Eklund wrote:
> >On Wed, Nov 10, 2004 at 08:42:10AM -0500, Jim Weirich wrote:
> >>Do we use HEAD to get the time stamp of the file?  Then we would just 
> >>need to
> >>download it whenever it changes.
> >
> >You already do something quite a bit like this; a GET is started, and
> >from this GET you retrieve the Content-Length.  If this match the cached
> >data, you use that; if not, you start a new GET and retrieve the file.
> I really don't like this algorithm.

The main problem is IMO the use of size as a discriminator.  The rest
isn't quite as bad as it sounds like.

> The program uses open-uri for
> most of its work, which is great for getting things going,
> proof-of-concept and so on, but for a scalable application I think
> we should aim to move to Net::HTTP and follow the protocol more
> closely. [1] Why?
> A Get is started.  Well, when the server responds with the content
> length it also responds with everything else as well, and in the
> case of a 200 response, that includes the whole of the resource. To
> the best of my knowledge (and because of my attempts to get this
> working with Rubric I've read it fairly recently) there is no way in
> the protocol to stop a get in the middle.

Well, the protocol is layered.  There is no way to tell it at HTTP
layer; however, there is at the TCP layer.

> Indeed (indent munged):
>     open(uri_str, :proxy => @http_proxy, :content_length_proc =>
>     lambda {|t| size = t; raise "break"}) {|i| }
> doesn't tell the server to stop sending the contents.  If the server
> detects something has stopped, then whether it does so "in time" is
> rather like a race condition.  Running over a 56k modem this is
> rather likely to be too late.

No.  This is going to be blocked by Nagle's algorithm in the TCP stack.
The net result is that you get two three-way handshakes plus two or
three extra 1500 byte packets (assuming Ether MTU) plus three extra
roundtrip delays in Nagle acceleration.

The problem would be with FAST networks, because Nagle would outrace the
process time slicing in the system, so the reset above would come after
there was a bunch of data in the pipeline.

> Suppose the contents change, but the length doesn't. At present we
> would be unable to detect this.

Correct.  However, I believe the the file presently monotonically grows
(because old versions are not removed), so this may not be an issue.

> "if not, you start a new GET and retrieve the file"  Then you get it
> again. Ouch.
> I think we should be using the head method, and the Etag,
> Last-Modified and any other applicable headers, which really
> necessitates using Net::HTTP.  Much more tedious to program, but
> much more courteous to the server('s owners).

Either that, or run an rsync implementation.  I think the latter would
be best, but more work.

> >This is present from 0.8.0 and up; older clients will always download
> >the complete file.  I don't know your client distribution, but this may
> >be the primary source of the load described elsewhere in the thread.
> Getting it twice really doesn't help this.

Sure.  But it may be a minor issue; it's hard to tell.

> [1] Please note: I am really in favour of this project,

I'm in favour of RubyGems as long as RubyGems gets done right :-)

I especially see the commitment to becoming repackager friendly as
important, as repackaging is crucial for making software effectively
usable for many users on many of the relevant platforms.

> and
> think criticism that is intended be constructive is a valid part of
> "first make it work, then make it work right, then make it fast".
> http://c2.com/cgi/wiki?MakeItWorkMakeItRightMakeItFast

One more thing around this:

Published interfacess and non-controlled data have important influence
on how it's appropriate to think around things.  For instance, the
present RubyGem client versions will continue loading the server until
everybody's upgraded.  And the "File called .Z but is really .gz" API
bug (and that file is an API) will need to be supported for a long

I got my thinking changed quite a bit when I started thinking
specifically about "published" vs "non-published" interfaces.  It helped
a lot of stuff get organized.  See
for Martin Fowler's quick comments on the same.


More information about the Rubygems-developers mailing list