From goodieboy at gmail.com Wed Apr 2 21:03:04 2008 From: goodieboy at gmail.com (Matt Mitchell) Date: Wed, 2 Apr 2008 21:03:04 -0400 Subject: [Blacklight-development] UVA Blacklight and Flare update Message-ID: I just deployed another version of UVA Blacklight as well as a hefty update to Flare. Flare: http://blacklight.rubyforge.org/svn/rails-plugins/flare/trunk/ A lot of the library has been simplified. Some classes and modules merged into a singe file. In the last release, "text filter" meant "query" (like in the Solr world). "facet filter" was actually a "filter query", using *_facet fields. These have changed to "query" and "filter", respectively. The url param name "ff" (facet filter) has changed to just "f" (filter). The url param name "tf" (text filter) has changed to "q" (query). The url param for filter fields now require the full field name. For example in the last release, a facet filter for "composition_era_facet" would be "composition_era". It now needs to be the full "composition_era_facet". This allows Flare to build query filters on any field. Not just *_facet fields. Pagination of facet values now working! There are a few new tests UVA Blacklight: http://blacklight.rubyforge.org/svn/branches/uva_lib/trunk/rails/ The search query handlers are all using Dismax. Each UI tab (catalog, music and semester at sea) has it's own Solr request handler with boosting on the title_text field. We'll tweak this according to usability/user requirements. The Z3950 availability requests are now "on click" driven. The glob of code that was in the Z3950 controller has been extracted into a neat little class Fielded searches are working! These are actually filter queries, not "q" queries. The RSS result sets are ordered by a Solr auto generated timestamp field We've switched from Mongrel Cluster to Thin: http://code.macournoyer.com/thin/ - just experimenting Now using a very recent "nightly build" of Solr I've had thoughts of turning Flare into a framework agnostic gem library. Wrappers for the url generation helpers would be the hardest part. Wouldn't it be cool to have Flare running in Merb: http://merbivore.com, or Camping: http://code.whytheluckystiff.net/camping - any thoughts on this? Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/blacklight-development/attachments/20080402/f077707f/attachment-0001.html From rochkind at jhu.edu Thu Apr 3 17:01:38 2008 From: rochkind at jhu.edu (Jonathan Rochkind) Date: Thu, 03 Apr 2008 17:01:38 -0400 Subject: [Blacklight-development] UVA Blacklight and Flare update In-Reply-To: References: Message-ID: <47F545B2.30308@jhu.edu> Awesome, thanks for the update. I'm still a bit confused about the relationship between Blacklight and Flare. If I svn up my Blacklight stuff, will I get the new Flare stuff too (perhaps through svn externals), or do I need to do something else to install the new Flare stuff? Or is blacklight still using it's own seperate code and not the actual trunk Flare? Jonathan Matt Mitchell wrote: > I just deployed another version of UVA Blacklight as well as a hefty update > to Flare. > > Flare: http://blacklight.rubyforge.org/svn/rails-plugins/flare/trunk/ > A lot of the library has been simplified. Some classes and modules merged > into a singe file. > In the last release, "text filter" meant "query" (like in the Solr world). > "facet filter" was actually a "filter query", using *_facet fields. These > have changed to "query" and "filter", respectively. > The url param name "ff" (facet filter) has changed to just "f" (filter). > The url param name "tf" (text filter) has changed to "q" (query). > The url param for filter fields now require the full field name. For example > in the last release, a facet filter for "composition_era_facet" would be > "composition_era". It now needs to be the full "composition_era_facet". This > allows Flare to build query filters on any field. Not just *_facet fields. > Pagination of facet values now working! > There are a few new tests > > UVA Blacklight: > http://blacklight.rubyforge.org/svn/branches/uva_lib/trunk/rails/ > The search query handlers are all using Dismax. Each UI tab (catalog, music > and semester at sea) has it's own Solr request handler with boosting on the > title_text field. We'll tweak this according to usability/user requirements. > The Z3950 availability requests are now "on click" driven. > The glob of code that was in the Z3950 controller has been extracted into a > neat little class > Fielded searches are working! These are actually filter queries, not "q" > queries. > The RSS result sets are ordered by a Solr auto generated timestamp field > We've switched from Mongrel Cluster to Thin: > http://code.macournoyer.com/thin/ - just experimenting > Now using a very recent "nightly build" of Solr > > I've had thoughts of turning Flare into a framework agnostic gem library. > Wrappers for the url generation helpers would be the hardest part. Wouldn't > it be cool to have Flare running in Merb: http://merbivore.com, or Camping: > http://code.whytheluckystiff.net/camping - any thoughts on this? > > Matt > > > ------------------------------------------------------------------------ > > _______________________________________________ > Blacklight-development mailing list > Blacklight-development at rubyforge.org > http://rubyforge.org/mailman/listinfo/blacklight-development > -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu From goodieboy at gmail.com Fri Apr 4 10:12:46 2008 From: goodieboy at gmail.com (Matt Mitchell) Date: Fri, 4 Apr 2008 10:12:46 -0400 Subject: [Blacklight-development] UVA Blacklight and Flare update In-Reply-To: <47F54246.1000806@jhu.edu> References: <47F54246.1000806@jhu.edu> Message-ID: Hi Jonathan, The plugins are all linked in using SVN externals. Which means, the code isn't there until you do a checkout. As soon you do a checkout the SVN client will pull in the data from the external repo. When you do an svn update, SVN will also update your externals. The reason I'm doing it this way is because we're really heavy in development right now, and if I'm working on Blacklight, but need to make a change to the externalized Flare code, I can make a change, but commit it back into Flare. This will probably change once we start to solidify the code/API. Like, right now the Flare and Blacklight demos are most certainly broken, because I've made so many changes to the Flare core this week. I'll see if I can get them both running again today or Monday. Does that help? Matt On Thu, Apr 3, 2008 at 4:47 PM, Jonathan Rochkind wrote: > Awesome, thanks for the update. I'm still a bit confused about the > relationship between Blacklight and Flare. If I svn up my Blacklight stuff, > will I get the new Flare stuff too (perhaps through svn externals), or do I > need to do something else to install the new Flare stuff? Or is blacklight > still using it's own seperate code and not the actual trunk Flare? > > Jonathan > > Matt Mitchell wrote: > > > I just deployed another version of UVA Blacklight as well as a hefty > > update > > to Flare. > > > > Flare: http://blacklight.rubyforge.org/svn/rails-plugins/flare/trunk/ > > A lot of the library has been simplified. Some classes and modules > > merged > > into a singe file. > > In the last release, "text filter" meant "query" (like in the Solr > > world). > > "facet filter" was actually a "filter query", using *_facet fields. > > These > > have changed to "query" and "filter", respectively. > > The url param name "ff" (facet filter) has changed to just "f" (filter). > > The url param name "tf" (text filter) has changed to "q" (query). > > The url param for filter fields now require the full field name. For > > example > > in the last release, a facet filter for "composition_era_facet" would be > > "composition_era". It now needs to be the full "composition_era_facet". > > This > > allows Flare to build query filters on any field. Not just *_facet > > fields. > > Pagination of facet values now working! > > There are a few new tests > > > > UVA Blacklight: > > http://blacklight.rubyforge.org/svn/branches/uva_lib/trunk/rails/ > > The search query handlers are all using Dismax. Each UI tab (catalog, > > music > > and semester at sea) has it's own Solr request handler with boosting on > > the > > title_text field. We'll tweak this according to usability/user > > requirements. > > The Z3950 availability requests are now "on click" driven. > > The glob of code that was in the Z3950 controller has been extracted > > into a > > neat little class > > Fielded searches are working! These are actually filter queries, not "q" > > queries. > > The RSS result sets are ordered by a Solr auto generated timestamp field > > We've switched from Mongrel Cluster to Thin: > > http://code.macournoyer.com/thin/ - just experimenting > > Now using a very recent "nightly build" of Solr > > > > I've had thoughts of turning Flare into a framework agnostic gem > > library. > > Wrappers for the url generation helpers would be the hardest part. > > Wouldn't > > it be cool to have Flare running in Merb: http://merbivore.com, or > > Camping: > > http://code.whytheluckystiff.net/camping - any thoughts on this? > > > > Matt > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Blacklight-development mailing list > > Blacklight-development at rubyforge.org > > http://rubyforge.org/mailman/listinfo/blacklight-development > > > > > > -- > Jonathan Rochkind > Digital Services Software Engineer > The Sheridan Libraries > Johns Hopkins University > 410.516.8886 rochkind (at) jhu.edu > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/blacklight-development/attachments/20080404/3408b07a/attachment.html From rochkind at jhu.edu Thu Apr 10 11:54:48 2008 From: rochkind at jhu.edu (Jonathan Rochkind) Date: Thu, 10 Apr 2008 11:54:48 -0400 Subject: [Blacklight-development] new blacklight Message-ID: <47FE3848.1090906@jhu.edu> So I checked out the brand new blacklight from svn. Thanks, I do think it makes sense to have your guys own effective 'trunk' in the actual svn trunk. I like how one command starts both rails (with mongrel?) and SOLR (is this jetty? Are you guys really using Jetty for production?). And thanks for still supplying a basic readme telling me how to get the thing started and such. But the new readme doesn't include instructions on how import anything into the SOLR/Lucene. Should I try to adopt the old instructions using the old ruby import routines (are those still in svn?). Or can you guys provide me with a couple sentances to get me started using your new Java import routines? (Do you still have your small demo set included in the svn? If not, mind if I add it back?). One thing I'm going to want to figure out how to do is how to change the MARC mapping of the new Java import routines. In particular, my local item level holdings are going to be in a different field and subfields than yours (not 999 like yours). Once I get to that point, is that going to require me to have a Java dev environment/compiler? I guess I'll find out once I get to that point. Thanks for any help you can provide. Sorry I approach this stuff in fits and starts; I work on it one day a week at most, which means I'm always having to remember everything I forgot over the past week. Jonathan -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu From rh9ec at virginia.edu Thu Apr 10 17:51:16 2008 From: rh9ec at virginia.edu (Robert Haschart) Date: Thu, 10 Apr 2008 17:51:16 -0400 Subject: [Blacklight-development] new blacklight Message-ID: <47FE8BD4.3050209@virginia.edu> Jonathan, Regarding using the new java-based indexer. I have updated the code stored in SVN at Rubyforge and added a Getting started document to walk you through getting started. I also included a set of sample records, that you can use to see how the process works. If you do a svn update you should see the new source code, the sample records, and the documentation about the program. -Bob From goodieboy at gmail.com Fri Apr 11 11:46:17 2008 From: goodieboy at gmail.com (Matt Mitchell) Date: Fri, 11 Apr 2008 11:46:17 -0400 Subject: [Blacklight-development] new blacklight In-Reply-To: <47FE3848.1090906@jhu.edu> References: <47FE3848.1090906@jhu.edu> Message-ID: On Thu, Apr 10, 2008 at 11:54 AM, Jonathan Rochkind wrote: > I like how one command starts both rails (with mongrel?) and SOLR (is > this jetty? Are you guys really using Jetty for production?). > Jetty is currently being used yes. That's not the plan for production though. I've been getting my hands dirty with jRuby over the past few months and we finally have a working jRuby version of Blacklight, which will more than likely be deployed along with Solr, in Tomcat. There are some little code ports to work out, but for the most part it's all looking very good. Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/blacklight-development/attachments/20080411/d8cb7837/attachment.html From rh9ec at virginia.edu Tue Apr 22 16:36:39 2008 From: rh9ec at virginia.edu (Robert Haschart) Date: Tue, 22 Apr 2008 16:36:39 -0400 Subject: [Blacklight-development] blacklight MARC importer In-Reply-To: <698BE1DD-385D-4EEE-89D2-5DF484ACE213@virginia.edu> References: <763570460804210724n1978169fo270786d41906cc5@mail.gmail.com> <698BE1DD-385D-4EEE-89D2-5DF484ACE213@virginia.edu> Message-ID: <480E4C57.7040501@virginia.edu> Jason, The speed you are seeing is really puzzling to me. On our latest large indexing run (about 4 million records) I was seeing indexing rates starting out at around 200 per second, and at the end dropping off to about 150 per second. Even more puzzling is the fact that you are seeing different numbers of records using the java importer as compared to your ruby version. I'm really interested in tracking down the cause of this discrepancy. -Bob Haschart Bess Sadler wrote: > Hi, Jason. > > I'm copying Bob Haschart on this message since he's our lead on the > indexer part of Blacklight and I think he can give you the latest > stats on indexing times and on your discrepancy. I'm also copying the > blacklight-development mailing list, since I think this is a > conversation that might be of interest to other folks using Blacklight. > > We started moving toward the java indexer for a couple of reasons, > mainly speed, but also because we wanted to produce something that > would be more generally useful to the community than a > Blacklight-specific indexer would be, and java seemed to be the way to > go for that since VuFind was already working on one. The other > advantage of working on a java-based indexer is that it becomes easier > to fold the code back into solr itself. Ultimately I think this would > be the greatest service to the library community wanting to use solr, > if solr could handle marc records natively and there was a > well-documented way to configure the mappings. I've talked to Erik > Hatcher about this a bit and he seems to think it would not be all > that hard to do, if someone who understands the code could make some > time to work on it. I'm hopeful we can make some time to work on this, > especially since we have Erik here in town... there's even been talk > of an after-work beer and coding session. :) > > We also considered going the CSV route, and I did some work on that > for a week or so, but I found that the content of our marc records was > so unpredictable that I couldn't find a column value that was > guaranteed never to be inside a data string, and eventually dealing > with all the control characters, punctuation, weird spacing, and who > knows what else drove me nuts and I dropped the project. I still think > it's a great idea in theory, but I wonder if real-world marc data can > ever be sanitized enough to make this work. > > Bob, can you give us some stats about indexing times for the current > java indexer? I know it was about three times faster than our ruby > indexer (for indexing ~4 million records, we went from three days to > less than one day) but there were a lot of factors involved there so > I'm not sure it's a clean comparison. > > Bess > > On Apr 21, 2008, at 10:24 AM, Jason Ronallo wrote: > >> Hi, Bess, >> Just wanted to let you know that I was able to finally put all the >> pieces together to get the MARC importer working. Strange thing is >> that the MARC importer reports that it indexed 1_379_299 records while >> both a Ruby script (with Forgiving Reader turned on) and MARCEdit >> report that there are 1_311_996 records in the file I was indexing. I >> haven't had a chance to investigate why there might be such a >> discrepancy, but thought it was a big enough one to let you know. >> >> Have you gone to the java indexer to speed things up? Have you done >> any benchmarks on your Ruby vs. java indexers? I hope to get the >> chance to look at your older Ruby indexer to see what it was doing. >> Did you send records in batches? I'm just wondering if you happen to >> know records per second on a large batch. For that large batch through >> the java indexer I got approximately 12 records per second. >> >> I'm considering writing a Ruby script that spits out a CSV file and >> then sends that to Solr. I've read the indexing this way ought to be >> fast. At least I'd have more control of some of the logic since I'm >> not ready to learn java just to write the bits that fall outside of >> what the MARCImporter can already do. >> >> In any case thanks to you and the other blacklight developers who >> definitely got me started playing with Solr. >> >> take care, >> Jason > > > Elizabeth (Bess) Sadler > Research and Development Librarian > Digital Scholarship Services > Box 400129 > Alderman Library > University of Virginia > Charlottesville, VA 22904 > > bess at virginia.edu > (434) 243-2305 > > From jronallo at gmail.com Wed Apr 23 08:42:10 2008 From: jronallo at gmail.com (Jason Ronallo) Date: Wed, 23 Apr 2008 08:42:10 -0400 Subject: [Blacklight-development] blacklight MARC importer In-Reply-To: <480E4C57.7040501@virginia.edu> References: <763570460804210724n1978169fo270786d41906cc5@mail.gmail.com> <698BE1DD-385D-4EEE-89D2-5DF484ACE213@virginia.edu> <480E4C57.7040501@virginia.edu> Message-ID: <763570460804230542m5364eb84m64ac37826610bb38@mail.gmail.com> On Tue, Apr 22, 2008 at 4:36 PM, Robert Haschart wrote: > Jason, > > The speed you are seeing is really puzzling to me. On our latest large > indexing run (about 4 million records) I was seeing indexing rates starting > out at around 200 per second, and at the end dropping off to about 150 per > second. I should have known better to put out a number without giving some indication that the specs of the machine the indexing ran on is underpowered. I'm guessing that explains it. > Even more puzzling is the fact that you are seeing different numbers of > records using the java importer as compared to your ruby version. I'm > really interested in tracking down the cause of this discrepancy. This is the one that really made me stop and think. Just to clarify, the Ruby script I wrote simply counted the records (using ForgivingReader) and did no indexing, but the count from the Ruby script did match the count from MARCEdit. I do remember at one point dealing with the records before that I couldn't use the regular MARC::Reader because it was throwing errors. In this case at least you can independently test with the same batch of records. The records I was indexing are the Oregon State University and Washington State University records at archive.org.[1] (open_library++ open_data++) One note on IA says the records were originally processed by MARCEdit. For the OSU records MARCImporter reported 1_379_299 records were imported, while MARCEdit and a Ruby counting script counted 1_311_996 records. During this batch I did notice some errors with getting the length of the record or some such. I'm sorry I didn't copy the error at the time. I fully expected because of these errors to have the MARCImporter report _less_ records than I had counted before, but not more. For the WSU batch MARCImporter reported indexing 1_576_308 which matched exactly the number reported by the Ruby counting script. The only change I made with this batch was to give java less memory. But when I did a search on the Solr data it reported 1_589_574 records matched for a field in which I had stored the name of the batch of records. This might be more me misunderstanding how Solr works than anything else. I hope this information helps you. Please let me know how else I can help investigate this problem. Jason [1] http://www.archive.org/details/marc_oregon_summit_records