[Rubygems-developers] Updating source index is slow

chad at chadfowler.com chad at chadfowler.com
Wed Nov 10 12:49:53 EST 2004

> On Wed, 10 Nov 2004, Eivind Eklund wrote:
>> On Wed, Nov 10, 2004 at 08:42:10AM -0500, Jim Weirich wrote:
>>> Do we use HEAD to get the time stamp of the file?  Then we would just
>>> need to
>>> download it whenever it changes.
>> You already do something quite a bit like this; a GET is started, and
>> from this GET you retrieve the Content-Length.  If this match the cached
>> data, you use that; if not, you start a new GET and retrieve the file.
> I really don't like this algorithm.  The program uses open-uri for
> most of its work, which is great for getting things going,
> proof-of-concept and so on, but for a scalable application I think
> we should aim to move to Net::HTTP and follow the protocol more
> closely. [1] Why?
> A Get is started.  Well, when the server responds with the content
> length it also responds with everything else as well, and in the
> case of a 200 response, that includes the whole of the resource. To
> the best of my knowledge (and because of my attempts to get this
> working with Rubric I've read it fairly recently) there is no way in
> the protocol to stop a get in the middle. Indeed (indent munged):
>      open(uri_str, :proxy => @http_proxy, :content_length_proc =>
>      lambda {|t| size = t; raise "break"}) {|i| }
> doesn't tell the server to stop sending the contents.  If the server
> detects something has stopped, then whether it does so "in time" is
> rather like a race condition.  Running over a 56k modem this is
> rather likely to be too late.
> Suppose the contents change, but the length doesn't. At present we
> would be unable to detect this.
> "if not, you start a new GET and retrieve the file"  Then you get it
> again. Ouch.

Yea, your logic makes sense.  It could be that the current code is going
to be slow whether we download the new index or not.  It's definitely not

> I think we should be using the head method, and the Etag,
> Last-Modified and any other applicable headers, which really
> necessitates using Net::HTTP.  Much more tedious to program, but
> much more courteous to the server('s owners).

I actually tried to do if-modified-since originally, and I ran into
problems with RubyForge not responding correctly (very weird stuff that
Tom Copeland and I couldn't figure out).  My ruby code was working on
every other server I tried, but I was taking too long to get it to work,
so Rich stepped in and whipped up the current incarnation.  I think using
If-Modified-Since is the right way to go.  We wouldn't actually need to
use the HEAD method in this case, since the "don't send data" behavior is
built into the HTTP spec when using If-Modified-Since.

>> This is present from 0.8.0 and up; older clients will always download
>> the complete file.  I don't know your client distribution, but this may
>> be the primary source of the load described elsewhere in the thread.
> Getting it twice really doesn't help this.
>> Eivind.
>          Hugh
> [1] Please note: I am really in favour of this project, and
> think criticism that is intended be constructive is a valid part of
> "first make it work, then make it work right, then make it fast".
> http://c2.com/cgi/wiki?MakeItWorkMakeItRightMakeItFast

I totally agree.  We're in stages 2 and 3 right now.

> This approach (open-uri) is perfectly valid for early stages of the
> project, but I think we nee to move beyond it if downloads are
> becoming a problem.
> _______________________________________________

We can actually use open-uri with the If-Modified-Since approach.  I think
that would be ideal.

Thanks for your comments and ideas, Hugh.


More information about the Rubygems-developers mailing list