[Blacklight-development] blacklight MARC importer
Robert Haschart
rh9ec at virginia.edu
Mon May 5 14:43:28 EDT 2008
Jason,
I've taken a look at the Oregon State records, and the records are
something of a mess. I wrote earlier today on the blacklight dev list
that marc4j (as distributed) has a problem where if a malformed record
is encountered, it stops reading at the point of the error, and then if
you try to continue, it starts reading at the point of the error,
expecting a new record to begin there.
That must be why MarcImporter was reporting more records. Although at
that point it is so confused, anything it reports is just nonsense. In
the next day or so I plan to check in a modified marc4j that handles
malformed data more robustly. I need to decide how much robustness to
try for. Also Jonathon Rochkind made a good suggestion that it should
somehow report about the malformed records rather than quietly skipping
over them.
Simply getting a reliable count of the records in the Oregon State data
has proven to be a challenge. Since marc4j was having trouble reading
them, my first thought was to run yaz-marcdump and grep for the '001'
fields, which produced a count of 1272508, quite a bit less than the
1311996 records you were reporting. Trying the same for the '245' field
produced a count of 1312463, quite a few more than you reported.
Eventually I patched the marc4j library and ran it on the data and there
are indeed 1311996 records, of which 76 have bad Marc record leaders
rendering them unreadable and 39534 of them have no 001 field making
them invalid records which even if they successfully ran through the
indexer, each of them would replace the preceding one.
-Robert Haschart
More information about the Blacklight-development
mailing list