[Blacklight-development] storing marc21 in solr

Bess Sadler eos8d at virginia.edu
Mon Mar 23 12:12:22 EDT 2009


Hi, folks. Here are some of my thoughts on this topic. In summary:  
When I started this conversation I thought we needed to standardize on  
marc21, but now I think we can just let people use either one.

Some goals of the project:
	- we want to access marc fields that aren't explicitly indexed
	- we want to be able to display the full record in a human-readable way
	- we want to provide RESTful access to our records for easy mash-up- 
ability (e.g., in xml, json, or mrc format)

If we accept that those are among our goals, then we have to store the  
full marc record somewhere. And if we accept that those goals are  
going to be met by the plugin (i.e., if these are core functionality  
instead of an individual institution's job to implement) then we have  
to pick a way to write that core functionality. I want to be able to  
access marc as marc (I don't want to just parse marc-xml as if it were  
any other flavor of xml, and I don't want to parse a marc string) and  
I especially don't want to write any methods that act upon marc more  
than once (e.g., "Act this way if the stored marc record is xml, and  
act this other way if it's marc21.)

I could, however, see core behavior where the CatalogController grabs  
whatever is in the stored_marc_display field, tests to see whether  
it's xml or marc21, and creates a MARC::Record object of it, and then  
all marc behaviors just act on that MARC::Record object. I really want  
to be writing code for MARC::Record objects as opposed to writing xml-  
or string-parsing code. My main motivation for thinking we had to  
store marc21 was that I thought that ruby-marc could only create  
MARC::Record objects from marc21, but thanks to Ross's email I have  
now discovered MARC::XMLReader.

So I think I should now be able to go off and write the behaviors I  
described above, as well as tests for the validity of stored marc21  
and marc-xml. Thank you, everyone.

Bess


On Mar 23, 2009, at 10:39 AM, Ross Singer wrote:

> I realize I'm coming into this a little late -- not sure if Bess's
> other thread about Base64 encoding this "solved it".
>
> I agree with Naomi that we don't need to standardize on anything, but,
> obviously, the more people that do it in a certain way helps provide
> some traction.
>
> I would argue that storing as a marc21 string has the bigger upside
> for now.  ruby-marc can only use REXML as its parser presently
> (although Ed and I had a conversation just last week about refactoring
> it to support Nokogiri, as well), so the binary marc reader (and, if
> needed, writer) is way faster.  Even accounting for Base64
> encoding/decoding.  It would also keep the index a whole lot smaller,
> if anybody cares about that.
>
> BTW, the reason why you're seeing the line breaks in Bess's script is
> because .to_s pretty prints the record.  If she called .to_marc it
> would look like Naomi's mushy blob of text.
>
> -Ross.
>
> 2009/3/23 Naomi Dushay <ndushay at stanford.edu>:
>> Maybe it's too soon to worry about this.  Is there any pressing  
>> need for all
>> of us to store our marc data in our indexes in the same format?    
>> What's the
>> real gain?  Why should the blacklight code be rigid about this?   
>> Remember, a
>> lot of our solr documents won't have marc.
>> What if there was a blacklight config setting to say what format  
>> your marc
>> was in, and the blacklight code accommodated the different formats?
>> Jessie is already working on manipulating the marc21 in the  
>> Stanford index
>> to create the UI features we want at Stanford.  UVa has been using  
>> marcxml.
>>  Jessie should jump in and tell us if the work he's doing can be  
>> easily
>> adapted to other marc formats -- hopefully the answer is "yes, it  
>> would be
>> trivial."   But let's get those features coded first, and refactor  
>> them
>> later.
>> I think we're doing great things here.  UVa has put out a great  
>> body of
>> code.  The active Stanford adoption of the code is causing an  
>> excellent
>> refactoring of the code, making it more configurable;  additional  
>> features
>> will be added as well.   Further active adopters or folks that get  
>> down and
>> dirty with the demo will provide more feedback, causing further  
>> improvements
>> - refactoring or new features or whatever.
>> The demo for release 2.0 is going to have some warts;  I don't  
>> think there's
>> a way around it.  Usage driven methodology would say we  have some  
>> folks
>> take the demo out for a spin, and get their feedback.  Have folks  
>> tried the
>> current demo?  Have they tried to go the next step, of getting  
>> their own
>> data behind the blacklight front end?   If not ... maybe we've  
>> gotten ahead
>> of ourselves?
>> Trying to guess a lot of this stuff before there's expressed demand  
>> from the
>> users ... I think that can make extra work, and the results might  
>> not be
>> useful, in the end.
>> Okay, I've got my asbestos suit on - let the flames be thrown!
>> - Naomi
>>
>> On Mar 22, 2009, at 6:03 PM, Bill Dueber wrote:
>>
>> solrmarc writes to the underlying lucene index using a binary field  
>> type
>> that is unavailable in solr. This makes for easy storage of MARC,  
>> but makes
>> round-tripping the data a little harder if that's your thing. Given  
>> that's
>> the case, I wonder if a standard solr String field (probably  
>> compressed) of
>> marcxml would be a better bet all around.
>>
>> Or push dchud into finishing his spec implementation of marc-json  
>> and store
>> it as json :-)
>>
>> On Sun, Mar 22, 2009 at 4:14 PM, Jonathan Rochkind  
>> <rochkind at jhu.edu> wrote:
>>>
>>> It makes sense to me to standardize 'what kind of MARC' we all  
>>> store in
>>> SOLR.  I guess you guys decided it made more sense for that to be  
>>> MARC21
>>> than MARC-XML?  Curious what the motivation for that was?
>>>
>>> Since MARC21 is a binary format, you've got to make sure that what  
>>> is
>>> going into the interface is binary identical to what should be, it  
>>> should
>>> not be re-encoded or translated at all.  MARC21 data can be in one  
>>> of at
>>> least two character encodings, which one is used is theoretically  
>>> mentioned
>>> in the MARC file (although it's often wrong).   MARC21 file length  
>>> (in
>>> bytes) is also noted in the MARC file header/leader.  The whole  
>>> thing needs
>>> to be byte-for-byte untouched when you stick it in your SOLR  
>>> index, I don't
>>> know enough about SOLR or the indexing process to know about  
>>> places that
>>> might introduce corruption there.
>>>
>>> MARC-XML is definitely a lot easier to work with, as it's just an  
>>> ordinary
>>> XML file in UTF-8.
>>>
>>> I don't know a whole lot about MARC21, but it would not surprise  
>>> me at
>>> all, if, like the charcter encoding issue, much of our MARC21 "in  
>>> the wild"
>>> was actually illegal in undocumented various ways, but ways that our
>>> traditional ILS's are tolerant of.
>>>
>>> It's also possible there is a bug in the ruby MARC library.
>>>
>>> Jonathan
>>>
>>> ________________________________________
>>> From: blacklight-development-bounces at rubyforge.org
>>> [blacklight-development-bounces at rubyforge.org] On Behalf Of Naomi  
>>> Dushay
>>> [ndushay at stanford.edu]
>>> Sent: Sunday, March 22, 2009 1:47 PM
>>> To: blacklight-development at rubyforge.org
>>> Subject: Re: [Blacklight-development] storing marc21 in solr
>>>
>>> Bess,
>>>
>>> I create our index using solrmarc, which writes directly to the  
>>> index
>>> (do not use solr, do not pass go, do not cellect $200.).   So I  
>>> can't
>>> be much help, sorry.
>>>
>>> However, I'll tell you that our marc21 doesn't look all nicely
>>> formatted like yours - it just appears as a single line and looks  
>>> like
>>> this:
>>>
>>> 00651nam a22001575?
>>>
>>> 4500001000800000008004100008020001500049035002200064100001900086245006900105260003300174300001100207440005400218596000600272999021500278
>>> a575946 910118s1990||||gw |||||||||||||||||ger|d    a3412215880
>>> a(CSt)notisACX2206 10 aSeifert, Arno. 14 aDer Ruckzug der biblischen
>>> Prophetie von der neueren Geschichte. 0  aKoln : bBohlau
>>> Verlag, c1990    a207 p.  0 aBeihefte zum Archiv für
>>> Kulturgeschichte ; vv.31    a1    aCB3 .A6 SUPPL. V.
>>> 31
>>>
>>> wLC
>>>
>>> c1
>>> i36105035087092 d6/18/2008 e6/18/2008 kCHECKEDOUT lSTACKS mGREEN  
>>> n4 p
>>> $40.00 rM sY tSTKS-MONO u1/18/1991 o.TECHSTAFF. c:al o.TECHSTAFF.  
>>> PUB
>>> 1/22/91/pb o.TECHSTAFF. MIDSPINE.v.31 suppl.
>>>
>>> Maybe you don't have "marc21" but something else?
>>>
>>> - Naomi
>>>
>>> On Mar 22, 2009, at 7:40 AM, Bess Sadler wrote:
>>>
>>>> Hey, folks.
>>>>
>>>> Hopefully I'm just having a moment of stupidity, but I've been
>>>> looking at this for awhile and I can't figure it out, so I'm asking
>>>> the community. We want to store marc21 in our solr index, right?  
>>>> UVA
>>>> has always stored marc-xml, but we had a talk with Stanford folks
>>>> awhile ago about this, there seem to be lots of good reasons to
>>>> store marc21 instead, and that's fine with me, so now I'm just
>>>> trying to make it happen.
>>>>
>>>> BUT... when I store what I think is marc21 into solr, it doesn't
>>>> come back out again quite right. I've isolated a really simple
>>>> example of this:
>>>>
>>>> require 'rubygems'
>>>> require 'marc'
>>>> require 'rsolr'
>>>>
>>>> puts " ************* from file ****************** "
>>>>
>>>> reader = MARC::Reader.new('/usr/local/projects/bl-demo/data/
>>>> test_data.utf8.mrc')
>>>> record2 = reader.first
>>>> puts record2.to_s
>>>>
>>>> puts " ************* from solr ****************** "
>>>>
>>>> rsolr = RSolr.connect
>>>> response = eval(rsolr.select(:q=>'*:*',:wt=>'ruby'))
>>>> marc_display = response["response"]["docs"][0]["marc_display"]
>>>> record = MARC::Record.new_from_marc(marc_display)
>>>> puts record.to_s
>>>>
>>>> My output for that script looks like this:
>>>>
>>>> ************* from file ******************
>>>> LEADER 00799cam a2200241 a 4500
>>>> 001    00282214
>>>> 003 DLC
>>>> 005 20090120022042.0
>>>> 008 000417s1998    pk            000 0 urdo
>>>> 010    $a    00282214
>>>> 025    $a P-U-00282214; 05; 06
>>>> 040    $a DLC $c DLC $d DLC
>>>> 041 1  $a urd $h snd
>>>> 042    $a lcode
>>>> 050 00 $a PK2788.9.A9 $b F55 1998
>>>> 100 1  $a Ayaz, Shaikh, $d 1923-1997.
>>>> 245 10 $a Fikr-i Ayāz / $c murattibīn, Āṣif Farruk̲h̲ī,
>>>> Shāh Muḥammad Pīrzādah.
>>>> 260    $a Karācī : $b Dāniyāl, $c [1998]
>>>> 300    $a 375 p. ; $c 23 cm.
>>>> 546    $a In Urdu.
>>>> 520    $a Selected poems and articles from the works of renowned
>>>> Sindhi poet; chiefly translated from Sindhi.
>>>> 700 1  $a Farruk̲h̲ī, Āṣif, $d 1959-
>>>> 700 1  $a Pīrzādah, Shāh Muḥammad.
>>>> ************* from solr ******************
>>>> LEADER 00799cam a2200241 a 4500
>>>> 001    00282214 *
>>>> 003 DLC*
>>>> 005 20090120022042.0*
>>>> 008 000417s1998    pk            000 0 urdo *
>>>>
>>>>
>>>> So, when I take a file (test_data.utf.mrc) and read in the marc
>>>> records from disk, it behaves as expected. But if I store it in  
>>>> solr
>>>> first, then what I get back out of solr appears to be badly formed
>>>> marc somehow. MARC::Record appears to read it in, but the record
>>>> created is truncated.
>>>>
>>>> Naomi, how do you folks store marc21? Do you do anything special to
>>>> it before putting it in the index? It occurred to me we could  
>>>> base64
>>>> encode it, but maybe there's something simpler that I'm missing?
>>>>
>>>> I'm storing the marc record at around line 96 of marc_mapper.rb:
>>>>
>>>>  # _display is stored, but not indexed
>>>>  # don't store a string, store marc21 so we can read it back out
>>>>  # into a MARC::Record object
>>>>  map :marc_display do |rec,index|
>>>>    rec.to_marc
>>>>  end
>>>>
>>>> Thanks in advance for any advice,
>>>>
>>>> Bess
>>>> _______________________________________________
>>>> Blacklight-development mailing list
>>>> Blacklight-development at rubyforge.org
>>>> http://rubyforge.org/mailman/listinfo/blacklight-development
>>>> Blacklightopac Blog http://blacklightopac.org/
>>>
>>> _______________________________________________
>>> Blacklight-development mailing list
>>> Blacklight-development at rubyforge.org
>>> http://rubyforge.org/mailman/listinfo/blacklight-development
>>> Blacklightopac Blog http://blacklightopac.org/
>>> _______________________________________________
>>> Blacklight-development mailing list
>>> Blacklight-development at rubyforge.org
>>> http://rubyforge.org/mailman/listinfo/blacklight-development
>>> Blacklightopac Blog http://blacklightopac.org/
>>
>>
>> --
>> Bill Dueber
>> Library Systems Programmer
>> University of Michigan Library
>> _______________________________________________
>> Blacklight-development mailing list
>> Blacklight-development at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/blacklight-development
>> Blacklightopac Blog http://blacklightopac.org/
>>
>> _______________________________________________
>> Blacklight-development mailing list
>> Blacklight-development at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/blacklight-development
>> Blacklightopac Blog http://blacklightopac.org/
>>
> _______________________________________________
> Blacklight-development mailing list
> Blacklight-development at rubyforge.org
> http://rubyforge.org/mailman/listinfo/blacklight-development
> Blacklightopac Blog http://blacklightopac.org/



More information about the Blacklight-development mailing list