[Blacklight-development] blacklight MARC importer

Robert Haschart rh9ec at virginia.edu
Mon May 5 14:43:28 EDT 2008


Jason,

I've taken a look at the Oregon State records, and the records are 
something of a mess.   I wrote earlier today on the blacklight dev list 
that marc4j (as distributed) has a problem where if a malformed record 
is encountered, it stops reading at the point of the error, and then if 
you try to continue, it starts reading at the point of the error, 
expecting a new record to begin there. 

That must be why MarcImporter was reporting more records.  Although at 
that point it is so confused, anything it reports is just nonsense.   In 
the next day or so I plan to check in a modified marc4j that handles 
malformed data more robustly.   I need to decide how much robustness to 
try for.   Also Jonathon Rochkind made a good suggestion that it should 
somehow report about the malformed records rather than quietly skipping 
over them.  

Simply getting a reliable count of the records in the Oregon State data 
has proven to be a challenge.  Since marc4j was having trouble reading 
them, my first thought was to run yaz-marcdump and grep for the '001' 
fields, which produced a count of 1272508, quite a bit less than the 
1311996 records you were reporting.  Trying the same for the '245' field 
produced a count of 1312463, quite a few more than you reported.  
Eventually I patched the marc4j library and ran it on the data and there 
are indeed 1311996 records, of which 76 have bad Marc record leaders 
rendering them unreadable and 39534 of them have no 001 field making 
them invalid records which even if they successfully ran through the 
indexer, each of them would replace the preceding one.

-Robert Haschart




More information about the Blacklight-development mailing list