[Rubygems-developers] Updating source index is slow

Hugh Sasse Staff Elec Eng hgs at dmu.ac.uk
Wed Nov 10 12:11:42 EST 2004

On Wed, 10 Nov 2004, Eivind Eklund wrote:

> On Wed, Nov 10, 2004 at 08:42:10AM -0500, Jim Weirich wrote:
>> Do we use HEAD to get the time stamp of the file?  Then we would just need to
>> download it whenever it changes.
> You already do something quite a bit like this; a GET is started, and
> from this GET you retrieve the Content-Length.  If this match the cached
> data, you use that; if not, you start a new GET and retrieve the file.

I really don't like this algorithm.  The program uses open-uri for
most of its work, which is great for getting things going,
proof-of-concept and so on, but for a scalable application I think
we should aim to move to Net::HTTP and follow the protocol more
closely. [1] Why?

A Get is started.  Well, when the server responds with the content
length it also responds with everything else as well, and in the
case of a 200 response, that includes the whole of the resource. To
the best of my knowledge (and because of my attempts to get this
working with Rubric I've read it fairly recently) there is no way in
the protocol to stop a get in the middle. Indeed (indent munged):

     open(uri_str, :proxy => @http_proxy, :content_length_proc =>
     lambda {|t| size = t; raise "break"}) {|i| }

doesn't tell the server to stop sending the contents.  If the server
detects something has stopped, then whether it does so "in time" is
rather like a race condition.  Running over a 56k modem this is
rather likely to be too late.

Suppose the contents change, but the length doesn't. At present we
would be unable to detect this.

"if not, you start a new GET and retrieve the file"  Then you get it
again. Ouch.

I think we should be using the head method, and the Etag,
Last-Modified and any other applicable headers, which really
necessitates using Net::HTTP.  Much more tedious to program, but
much more courteous to the server('s owners).

> This is present from 0.8.0 and up; older clients will always download
> the complete file.  I don't know your client distribution, but this may
> be the primary source of the load described elsewhere in the thread.

Getting it twice really doesn't help this.
> Eivind.


[1] Please note: I am really in favour of this project, and
think criticism that is intended be constructive is a valid part of
"first make it work, then make it work right, then make it fast".


This approach (open-uri) is perfectly valid for early stages of the
project, but I think we nee to move beyond it if downloads are
becoming a problem.

More information about the Rubygems-developers mailing list