[Blacklight-development] blacklight MARC importer

Jason Ronallo jronallo at gmail.com
Wed Apr 23 08:42:10 EDT 2008


On Tue, Apr 22, 2008 at 4:36 PM, Robert Haschart <rh9ec at virginia.edu> wrote:
> Jason,
>
>  The speed you are seeing is really puzzling to me. On our latest large
> indexing run (about 4 million records) I was seeing indexing rates starting
> out at around 200 per second, and at the end dropping off to about 150 per
> second.

I should have known better to put out a number without giving some
indication that the specs of the machine the indexing ran on is
underpowered. I'm guessing that explains it.

>  Even more puzzling is the fact that you are seeing different numbers of
> records using the java importer as compared to your ruby version.  I'm
> really interested in tracking down the cause of this discrepancy.

This is the one that really made me stop and think. Just to clarify,
the Ruby script I wrote simply counted the records (using
ForgivingReader) and did no indexing, but the count from the Ruby
script did match the count from MARCEdit. I do remember at one point
dealing with the records before that I couldn't use the regular
MARC::Reader because it was throwing errors. In this case at least you
can independently test with the same batch of records. The records I
was indexing are the Oregon State University and Washington State
University records at archive.org.[1] (open_library++ open_data++) One
note on IA says the records were originally processed by MARCEdit.

For the OSU records MARCImporter reported 1_379_299 records were
imported, while MARCEdit and a Ruby counting script counted 1_311_996
records. During this batch I did notice some errors with getting the
length of the record or some such. I'm sorry I didn't copy the error
at the time. I fully expected because of these errors to have the
MARCImporter report _less_ records than I had counted before, but not
more.

For the WSU batch MARCImporter reported indexing 1_576_308 which
matched exactly the number reported by the Ruby counting script. The
only change I made with this batch was to give java less memory. But
when I did a search on the Solr data it reported 1_589_574 records
matched for a field in which I had stored the name of the batch of
records. This might be more me misunderstanding how Solr works than
anything else.

I hope this information helps you. Please let me know how else I can
help investigate this problem.

Jason

[1] http://www.archive.org/details/marc_oregon_summit_records


More information about the Blacklight-development mailing list