[Blacklight-development] blacklight MARC importer
Robert Haschart
rh9ec at virginia.edu
Tue Apr 22 16:36:39 EDT 2008
Jason,
The speed you are seeing is really puzzling to me. On our latest large
indexing run (about 4 million records) I was seeing indexing rates
starting out at around 200 per second, and at the end dropping off to
about 150 per second.
Even more puzzling is the fact that you are seeing different numbers of
records using the java importer as compared to your ruby version. I'm
really interested in tracking down the cause of this discrepancy.
-Bob Haschart
Bess Sadler wrote:
> Hi, Jason.
>
> I'm copying Bob Haschart on this message since he's our lead on the
> indexer part of Blacklight and I think he can give you the latest
> stats on indexing times and on your discrepancy. I'm also copying the
> blacklight-development mailing list, since I think this is a
> conversation that might be of interest to other folks using Blacklight.
>
> We started moving toward the java indexer for a couple of reasons,
> mainly speed, but also because we wanted to produce something that
> would be more generally useful to the community than a
> Blacklight-specific indexer would be, and java seemed to be the way to
> go for that since VuFind was already working on one. The other
> advantage of working on a java-based indexer is that it becomes easier
> to fold the code back into solr itself. Ultimately I think this would
> be the greatest service to the library community wanting to use solr,
> if solr could handle marc records natively and there was a
> well-documented way to configure the mappings. I've talked to Erik
> Hatcher about this a bit and he seems to think it would not be all
> that hard to do, if someone who understands the code could make some
> time to work on it. I'm hopeful we can make some time to work on this,
> especially since we have Erik here in town... there's even been talk
> of an after-work beer and coding session. :)
>
> We also considered going the CSV route, and I did some work on that
> for a week or so, but I found that the content of our marc records was
> so unpredictable that I couldn't find a column value that was
> guaranteed never to be inside a data string, and eventually dealing
> with all the control characters, punctuation, weird spacing, and who
> knows what else drove me nuts and I dropped the project. I still think
> it's a great idea in theory, but I wonder if real-world marc data can
> ever be sanitized enough to make this work.
>
> Bob, can you give us some stats about indexing times for the current
> java indexer? I know it was about three times faster than our ruby
> indexer (for indexing ~4 million records, we went from three days to
> less than one day) but there were a lot of factors involved there so
> I'm not sure it's a clean comparison.
>
> Bess
>
> On Apr 21, 2008, at 10:24 AM, Jason Ronallo wrote:
>
>> Hi, Bess,
>> Just wanted to let you know that I was able to finally put all the
>> pieces together to get the MARC importer working. Strange thing is
>> that the MARC importer reports that it indexed 1_379_299 records while
>> both a Ruby script (with Forgiving Reader turned on) and MARCEdit
>> report that there are 1_311_996 records in the file I was indexing. I
>> haven't had a chance to investigate why there might be such a
>> discrepancy, but thought it was a big enough one to let you know.
>>
>> Have you gone to the java indexer to speed things up? Have you done
>> any benchmarks on your Ruby vs. java indexers? I hope to get the
>> chance to look at your older Ruby indexer to see what it was doing.
>> Did you send records in batches? I'm just wondering if you happen to
>> know records per second on a large batch. For that large batch through
>> the java indexer I got approximately 12 records per second.
>>
>> I'm considering writing a Ruby script that spits out a CSV file and
>> then sends that to Solr. I've read the indexing this way ought to be
>> fast. At least I'd have more control of some of the logic since I'm
>> not ready to learn java just to write the bits that fall outside of
>> what the MARCImporter can already do.
>>
>> In any case thanks to you and the other blacklight developers who
>> definitely got me started playing with Solr.
>>
>> take care,
>> Jason
>
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
> bess at virginia.edu <mailto:bess at virginia.edu>
> (434) 243-2305
>
>
More information about the Blacklight-development
mailing list