From dbalmain.ml at gmail.com Thu Jun 1 00:05:13 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 1 Jun 2006 13:05:13 +0900 Subject: [Ferret-talk] about WildQuery ! In-Reply-To: <0906c2b97b2b8ec30a59f99d91fbb94f@ruby-forum.com> References: <0906c2b97b2b8ec30a59f99d91fbb94f@ruby-forum.com> Message-ID: On 6/1/06, ferret user wrote: > when i use WildQuery ,i was so slowly!! > the query string like this : 'name|title:*test*' > i search field 'name' and 'title' what include string 'test' > it worked ,but too slow This is not surprising. To run this query Ferret needs to check every single term in the "title" field. > but when i use query string like this : 'name|title:test*' This should be a lot faster. With this query Ferret can scan straight to the point in the index where terms start with the string test and scan through them one by one until they no longer start with test and it's finished. Actually, this query gets optimized to a PrefixQuery. > or 'name|title:*test' it worked fast I'm quite surprised this is much faster. I'd expect it to be a little faster but not by much. Maybe WildCard query can be optimized a little. I'll look into it but don't expect much. The index isn't really built for this type of query. Do you need to run this query often? What exactly are you trying to do? I might be able to come up with a better way to do it. Cheers, Dave From sunshine82 at yeah.net Thu Jun 1 00:17:22 2006 From: sunshine82 at yeah.net (ferret user) Date: Thu, 1 Jun 2006 06:17:22 +0200 Subject: [Ferret-talk] about WildQuery ! In-Reply-To: References: <0906c2b97b2b8ec30a59f99d91fbb94f@ruby-forum.com> Message-ID: i just want search a substring like : search 'te' in 'ddte fssfsf' but i cannot find a better way~ thanks -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Thu Jun 1 00:42:04 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 1 Jun 2006 13:42:04 +0900 Subject: [Ferret-talk] about WildQuery ! In-Reply-To: References: <0906c2b97b2b8ec30a59f99d91fbb94f@ruby-forum.com> Message-ID: On 6/1/06, ferret user wrote: > i just want search a substring > like : search 'te' in 'ddte fssfsf' > but i cannot find a better way~ > thanks If you just want to have the ability to search for different substrings like this then WildQuery is your best option for now. Are you on Windows? If you are then it'll be a lot faster when I finally get the extension compiled for Windows. If you are already using the C extension then the speed you see will probably be the best you can get. If you are running this query a lot you have a couple of options. For shorter strings such as titles and names you could right a custom Analyzer which will break up a term like this; fastest search -> (fastest astest stest test est) (search earch arch rch) where the brackets represent one term position. Now you can do test* instead of *test* or ear* instead of *ear*. In the example I stopped at strings of length 3 but you can keep going if you want to search for *t*. This will make the index considerable larger so if you are searching really large strings then you should probably read up about Suffix Arrays. Cheers, Dave PS: look at lib/ferret/analysis/analyzers.rb to see how to implement a custom analyzer. From sunshine82 at yeah.net Thu Jun 1 00:57:36 2006 From: sunshine82 at yeah.net (ferret user) Date: Thu, 1 Jun 2006 06:57:36 +0200 Subject: [Ferret-talk] about WildQuery ! In-Reply-To: References: <0906c2b97b2b8ec30a59f99d91fbb94f@ruby-forum.com> Message-ID: Thanks David Balmain i try using 'test*' instead of '*test*' thanks a lot -- Posted via http://www.ruby-forum.com/. From tom.oristian at gmail.com Thu Jun 1 03:35:18 2006 From: tom.oristian at gmail.com (Tom On) Date: Thu, 1 Jun 2006 09:35:18 +0200 Subject: [Ferret-talk] how to index the result of any instance method In-Reply-To: <20060523074030.GA16446@cordoba.webit.de> References: <7b42c730d49cce49b9dd38ba16909019@ruby-forum.com> <562a35c10605220037g4c63caaeq11d01e2a1da5f89c@mail.gmail.com> <20060522092141.GF26544@cordoba.webit.de> <19f81dea6b7169a7c199b676950569d9@ruby-forum.com> <20060523074030.GA16446@cordoba.webit.de> Message-ID: Jens Kraemer wrote: > On Tue, May 23, 2006 at 08:42:10AM +0200, Tom wrote: >> > class MyModel < ActiveRecord::Base >> >> Now I've got the approach figured out, but I seem to be having problems >> yet. It seems that my full_text method is not actually being indexed. >> In fact, I've placed a breakpoint inside the method and it seems that >> it's never even being called. Meanwhile, Ferret still manages to update >> index with every new instance of MyModel, but without the full_text >> value. I also placed a breakpoint in >> vendor/plugins/acts_as_ferret/rebuild_index.rb and it appears that IT is >> never called when a new model instance is created. Any thoughts? > > What version of acts_as_ferret do you use ? Could you try to upgrade > from svn ? rebuild_index.rb has been removed some time ago as it is > obsolete. > > Jens > > -- > webit! Gesellschaft f?r neue Medien mbH www.webit.de > Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de > Schnorrstra?e 76 Tel +49 351 46766 0 > D-01069 Dresden Fax +49 351 46766 66 Okay, thanks. I've got it working now for simple text files. Can anybody share any experience/opinion they have on using Ruby to process and index/search Microsoft documents and PDFs ??? Thanks for any help. -- Posted via http://www.ruby-forum.com/. From alex at blackkettle.org Thu Jun 1 03:46:38 2006 From: alex at blackkettle.org (Alex Young) Date: Thu, 01 Jun 2006 08:46:38 +0100 Subject: [Ferret-talk] Windows progress Message-ID: <447E9B5E.9080009@blackkettle.org> Hi there, What's the current status of the Windows port? I may be in a position to lend a hand over the next couple of weeks - where should I start looking? And what's the best way to get SVN HEAD? This happens: $ svn checkout svn://www.davebalmain.com/ferret/trunk ferret svn: Can't connect to host 'www.davebalmain.com': Connection refused -- Alex From kraemer at webit.de Thu Jun 1 04:24:12 2006 From: kraemer at webit.de (Jens Kraemer) Date: Thu, 1 Jun 2006 10:24:12 +0200 Subject: [Ferret-talk] how to index the result of any instance method In-Reply-To: References: <7b42c730d49cce49b9dd38ba16909019@ruby-forum.com> <562a35c10605220037g4c63caaeq11d01e2a1da5f89c@mail.gmail.com> <20060522092141.GF26544@cordoba.webit.de> <19f81dea6b7169a7c199b676950569d9@ruby-forum.com> <20060523074030.GA16446@cordoba.webit.de> Message-ID: <20060601082412.GD27209@cordoba.webit.de> On Thu, Jun 01, 2006 at 09:35:18AM +0200, Tom On wrote: > > Okay, thanks. I've got it working now for simple text files. Can > anybody share any experience/opinion they have on using Ruby to process > and index/search Microsoft documents and PDFs ??? Thanks for any help. In RDig I use the wvText and pdftotext to extract textual content from word and pdf documents. Imho there is no Ruby lib yet to do this. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From dbalmain.ml at gmail.com Thu Jun 1 04:38:23 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 1 Jun 2006 17:38:23 +0900 Subject: [Ferret-talk] how to index the result of any instance method In-Reply-To: <20060601082412.GD27209@cordoba.webit.de> References: <7b42c730d49cce49b9dd38ba16909019@ruby-forum.com> <562a35c10605220037g4c63caaeq11d01e2a1da5f89c@mail.gmail.com> <20060522092141.GF26544@cordoba.webit.de> <19f81dea6b7169a7c199b676950569d9@ruby-forum.com> <20060523074030.GA16446@cordoba.webit.de> <20060601082412.GD27209@cordoba.webit.de> Message-ID: On 6/1/06, Jens Kraemer wrote: > On Thu, Jun 01, 2006 at 09:35:18AM +0200, Tom On wrote: > > > > Okay, thanks. I've got it working now for simple text files. Can > > anybody share any experience/opinion they have on using Ruby to process > > and index/search Microsoft documents and PDFs ??? Thanks for any help. > > In RDig I use the wvText and pdftotext to extract textual content from word > and pdf documents. Imho there is no Ruby lib yet to do this. I second that. I've tried the Ruby pdf reader alternitives on RAA without much luck. If anyone knows a good pdf reading opensource C library I'd be happy to write some bindings. But I think wvText and pdftotext are your best options right now. From dbalmain.ml at gmail.com Thu Jun 1 05:18:47 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 1 Jun 2006 18:18:47 +0900 Subject: [Ferret-talk] Windows progress In-Reply-To: <447E9B5E.9080009@blackkettle.org> References: <447E9B5E.9080009@blackkettle.org> Message-ID: On 6/1/06, Alex Young wrote: > Hi there, > > What's the current status of the Windows port? I may be in a position > to lend a hand over the next couple of weeks - where should I start > looking? Hi Alex, Thanks for your interest. I got Ferret to compile with Visual Studio Express 2005. Unfortunately you currently need to use Visual C 6 to create Ruby bindings. This proved a lot more difficult so I decided to take a different route. Marvin Humphrey (author of KinoSearch, a perl port of lucene) and I are about to start a new project at Apache called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) which will aim to create a C port of Lucene that can be used as a backend in all dynamic languages. This time around, portability will be a much higher priority. Lucy may or may not one day become the back end to Ferret. At the same time I'm experimenting with some different options using the Ferret codebase. Now that Lucy is happening I'm not going to worry about Lucene index compatibility (which was currently still a long way off in Ferret due to Java's modified UTF-8 encoding). This experimental code is in; svn://www.davebalmain.com/exp This code is much more portable and will compile with VC6. So if you want a Windows port quickly you can try merging this code back into Ferret propper. Or if you are really interested in the libraries internals you could join me working on this experimental code or join Marvin and I on the Lucy project (still waiting on Apache approval). Whichever route you chose your help will be most appreciated. Let me know your thoughts. Cheers, Dave > And what's the best way to get SVN HEAD? This happens: > $ svn checkout svn://www.davebalmain.com/ferret/trunk ferret > svn: Can't connect to host 'www.davebalmain.com': Connection refused Sorry about that. Subversion is up and running again. From alex at blackkettle.org Thu Jun 1 07:27:49 2006 From: alex at blackkettle.org (Alex Young) Date: Thu, 01 Jun 2006 12:27:49 +0100 Subject: [Ferret-talk] Windows progress In-Reply-To: References: <447E9B5E.9080009@blackkettle.org> Message-ID: <447ECF35.8070500@blackkettle.org> David Balmain wrote: > On 6/1/06, Alex Young wrote: >> Hi there, >> >> What's the current status of the Windows port? I may be in a position >> to lend a hand over the next couple of weeks - where should I start >> looking? > > Hi Alex, > > Thanks for your interest. I got Ferret to compile with Visual Studio > Express 2005. Unfortunately you currently need to use Visual C 6 to > create Ruby bindings. A few groups have been bitten by this. I believe this is something Curt Hibbs is going to be addressing with the next One-Click Installer. I don't know if you've been following ruby-lang, but there are noises to move over to a mingw32 build instead of a VC6, which would sort a *lot* of things out. If that ends up happening, extension building on Windows will get much simpler. As far as I know, the OCI only uses VC6 because it was believed at the time that it would be compatible with mingw32 extensions. For my purposes, I don't especially mind building my own Ruby to make Ferret compatible with it, but I can see that approach may not have too many adherents :-) Do you see any reason why that wouldn't work with the current Ferret source? Would that not be the shortest path to getting it working? > This proved a lot more difficult so I decided to > take a different route. Marvin Humphrey (author of KinoSearch, a perl > port of lucene) and I are about to start a new project at Apache > called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) which > will aim to create a C port of Lucene that can be used as a backend in > all dynamic languages. This time around, portability will be a much > higher priority. I'm sure you've considered this, but what does that add compared to a GCJ+SWIG approach, as with PyLucene? Without having looked at it, is there anything which prevents that method from being applied to Ruby? > Lucy may or may not one day become the back end to Ferret. At the same > time I'm experimenting with some different options using the Ferret > codebase. Now that Lucy is happening I'm not going to worry about > Lucene index compatibility (which was currently still a long way off > in Ferret due to Java's modified UTF-8 encoding). This experimental > code is in; > > svn://www.davebalmain.com/exp > > This code is much more portable and will compile with VC6. So if you > want a Windows port quickly you can try merging this code back into > Ferret propper. Or if you are really interested in the libraries > internals you could join me working on this experimental code or join > Marvin and I on the Lucy project (still waiting on Apache approval). > Whichever route you chose your help will be most appreciated. Let me > know your thoughts. From my personal point of view, I'm most interested in having the same codebase work fast on both Linux and Windows, and, like I say, I don't mind rebuilding Ruby to do it. Right now, I'd be most interested in patching the current cFerret to work under mingw32, unless you know of any reasons that's just not going to work. I'll certainly take a look at the new code and see if there's anything I can usefully add there, too. Thanks, -- Alex From dbalmain.ml at gmail.com Thu Jun 1 09:15:51 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 1 Jun 2006 22:15:51 +0900 Subject: [Ferret-talk] Windows progress In-Reply-To: <447ECF35.8070500@blackkettle.org> References: <447E9B5E.9080009@blackkettle.org> <447ECF35.8070500@blackkettle.org> Message-ID: On 6/1/06, Alex Young wrote: > David Balmain wrote: > > On 6/1/06, Alex Young wrote: > >> Hi there, > >> > >> What's the current status of the Windows port? I may be in a position > >> to lend a hand over the next couple of weeks - where should I start > >> looking? > > > > Hi Alex, > > > > Thanks for your interest. I got Ferret to compile with Visual Studio > > Express 2005. Unfortunately you currently need to use Visual C 6 to > > create Ruby bindings. > A few groups have been bitten by this. I believe this is something Curt > Hibbs is going to be addressing with the next One-Click Installer. I > don't know if you've been following ruby-lang, but there are noises to > move over to a mingw32 build instead of a VC6, which would sort a *lot* > of things out. If that ends up happening, extension building on Windows > will get much simpler. As far as I know, the OCI only uses VC6 because > it was believed at the time that it would be compatible with mingw32 > extensions. Actually the main reason I haven't finished porting to Windows yet is that it seemed like too much work if the one-click installer is going to change to mingw32 anyway. I hope it happens soon. > For my purposes, I don't especially mind building my own Ruby to make > Ferret compatible with it, but I can see that approach may not have too > many adherents :-) Do you see any reason why that wouldn't work with > the current Ferret source? Would that not be the shortest path to > getting it working? Yes, this would probably be the shortest path to get it working. Plus you'll have much better locale support (ie utf-3 support). > > This proved a lot more difficult so I decided to > > take a different route. Marvin Humphrey (author of KinoSearch, a perl > > port of lucene) and I are about to start a new project at Apache > > called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) which > > will aim to create a C port of Lucene that can be used as a backend in > > all dynamic languages. This time around, portability will be a much > > higher priority. > I'm sure you've considered this, but what does that add compared to a > GCJ+SWIG approach, as with PyLucene? Without having looked at it, is > there anything which prevents that method from being applied to Ruby? It can be done but it's still a lot of work and I just didn't feel up to the task. Plus we get better performance this way with a much smaller download. > > Lucy may or may not one day become the back end to Ferret. At the same > > time I'm experimenting with some different options using the Ferret > > codebase. Now that Lucy is happening I'm not going to worry about > > Lucene index compatibility (which was currently still a long way off > > in Ferret due to Java's modified UTF-8 encoding). This experimental > > code is in; > > > > svn://www.davebalmain.com/exp > > > > This code is much more portable and will compile with VC6. So if you > > want a Windows port quickly you can try merging this code back into > > Ferret propper. Or if you are really interested in the libraries > > internals you could join me working on this experimental code or join > > Marvin and I on the Lucy project (still waiting on Apache approval). > > Whichever route you chose your help will be most appreciated. Let me > > know your thoughts. > From my personal point of view, I'm most interested in having the same > codebase work fast on both Linux and Windows, and, like I say, I don't > mind rebuilding Ruby to do it. Right now, I'd be most interested in > patching the current cFerret to work under mingw32, unless you know of > any reasons that's just not going to work. I'll certainly take a look > at the new code and see if there's anything I can usefully add there, too. Have fun. I don't think it'll be too much work getting it to compile under mingw32. I guess we'll see. Cheers, Dave From Pedro.CorteReal at iantt.pt Thu Jun 1 11:09:00 2006 From: Pedro.CorteReal at iantt.pt (Pedro =?ISO-8859-1?Q?C=F4rte-Real?=) Date: Thu, 01 Jun 2006 16:09:00 +0100 Subject: [Ferret-talk] Using acts_as_ferret outside rails Message-ID: <1149174540.9218.7.camel@localhost.localdomain> How do I use a model that has "acts_as_ferret" defined outside rails? I've tried adding this to the top of my model: require "#{RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb" and then defining RAILS_ROOT and RAILS_ENV in my app that uses the model outside rails. That got me this exception: ../vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb:465:in `ferret_create': undefined method `debug' for nil:NilClass (NoMethodError) Anyone know what's the correct way to do this? Thanks, Pedro. From Pedro.CorteReal at iantt.pt Thu Jun 1 11:59:30 2006 From: Pedro.CorteReal at iantt.pt (Pedro =?ISO-8859-1?Q?C=F4rte-Real?=) Date: Thu, 01 Jun 2006 16:59:30 +0100 Subject: [Ferret-talk] Using acts_as_ferret outside rails In-Reply-To: <1149174540.9218.7.camel@localhost.localdomain> References: <1149174540.9218.7.camel@localhost.localdomain> Message-ID: <1149177570.9218.10.camel@localhost.localdomain> On Thu, 2006-06-01 at 16:09 +0100, Pedro C?rte-Real wrote: > How do I use a model that has "acts_as_ferret" defined outside rails? > I've tried adding this to the top of my model: > > require > "#{RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb" > > and then defining RAILS_ROOT and RAILS_ENV in my app that uses the model > outside rails. That got me this exception: > > ../vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb:465:in > `ferret_create': undefined method `debug' for nil:NilClass > (NoMethodError) > > Anyone know what's the correct way to do this? I found a way to do it. At the start of my script: require File.dirname(__FILE__) + '/../config/boot' ENV["RAILS_ENV"] = 'development' require RAILS_ROOT + '/config/environment' After that I can use every model as if I was inside rails. I copied most of this from the code for script/runner. Cheers, Pedro. From marvin at rectangular.com Thu Jun 1 14:00:35 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Thu, 1 Jun 2006 11:00:35 -0700 Subject: [Ferret-talk] Windows progress In-Reply-To: References: <447E9B5E.9080009@blackkettle.org> <447ECF35.8070500@blackkettle.org> Message-ID: <763AC6A3-E8B7-4342-9498-22E146065CAF@rectangular.com> On Jun 1, 2006, at 6:15 AM, David Balmain wrote: > >>> This proved a lot more difficult so I decided to >>> take a different route. Marvin Humphrey (author of KinoSearch, a >>> perl >>> port of lucene) and I are about to start a new project at Apache >>> called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) >>> which >>> will aim to create a C port of Lucene that can be used as a >>> backend in >>> all dynamic languages. This time around, portability will be a much >>> higher priority. >> I'm sure you've considered this, but what does that add compared to a >> GCJ+SWIG approach, as with PyLucene? Without having looked at it, is >> there anything which prevents that method from being applied to Ruby? > > It can be done but it's still a lot of work and I just didn't feel up > to the task. Plus we get better performance this way with a much > smaller download. Java Lucene is built on the assumption, quite reasonable for Java as a compiled language[1], that method calls are cheap and object creation and destruction are cheap. The fact that they are much more expensive in an interpreted language is the main reason the pure-Perl port of Lucene, Plucene, runs so slowly (). Lack of access to primitive data types such as int is another reason, but it's actually not that great a factor compared to the OO overhead (I did extensive hacking on Plucene before deciding I had no choice but to start from scratch, and rewriting the IO classes in C didn't help as much as anyone expected). Presumably similar factors are at work slowing down the pure-Ruby Ferret. The OO overhead problems are mitigated by going the GCJ route, but not eliminated. Say you want to subclass Analyzer -- which most significant deployments of Lucene will want to do eventually. The way a TokenStream works in Lucene, several method calls are required for each and every token -- one for each Analyzer the token passes through. That gets extremely expensive in an interpreted language. Furthermore, none of Perl's native string manipulation tools work with UTF-16 strings. So if you wanted to, say, insert a custom Perl TokenFilter into a Lucene Analysis chain, you'd have to translate between UTF-8 and UTF-16 each time you cross the Perl/Java boundary, making the TokenStream concept a double disaster. An alternate way of processing Tokens is to have each link in the Analyzer chain accept a "TokenBatch" instead of a TokenStream: an array of Tokens, rather than a stream of Tokens. That way, each Analyzer can iterate over all the Tokens in a tight loop, either natively or in C. The downside of this technique is that it's not possible to feed it directly from a filehandle/Reader, but that's small potatoes. It would be possible to graft the TokenBatch concept onto a GCJ'd Lucene: create a native full analysis chain which spits out a TokenBatch, then have the TokenBatch pretend it's a TokenStream, feeding Tokens to Lucene using a C version of next(). That would perform OK -- but you couldn't ever mix and match Java Lucene Analyzers with native Analyzers, only prepend the native onto the front. Therefore, you'd have to rewrite the entire org.apache.lucene.analysis package anyway -- it's the only way you're going to get both full flexibility and performance. And once you've started down the path of rewriting large portions of Lucene, it's hard to see why you'd put up with the headache of the GCJ approach. There are many other areas where Lucene's architecture is poorly suited for use with an interpreted language. Dave has solved those problems mainly by rewriting the whole thing in C. KinoSearch has taken that approach in some cases, but more often than Ferret, it uses modified algorithms instead. TokenBatch is one example; the best one, which is harder to explain here, is how KinoSearch merges together inverted documents during indexing. (In summary, it's faster, simpler, and requires far, far fewer objects.) It would be possible to port some of these algorithm changes to Lucene, but they would be pretty disruptive. Lucene's a mature, heavily-used library and changing anything at all requires a lot of consideration. Some of the changes I would like to see, I don't think I could lobby for in good conscience. The bytecounts-as-string- headers patch is a good example. For Ferret and KinoSearch it's adoption would yield a very significant benefit, as it would open the door to using Luke to browse indexes. For Java Lucene, though, it can only be justified by further changes which build upon it. The downside of the full-port approach that Dave and I have taken is that it's a lot of work to build and maintain. However, we've already done the vast majority of the up-front work once. Re-doing it for Lucy will be a cakewalk in comparison. The maintenance problem that KinoSearch and Ferret currently face, we're addressing by sharing the C core. We would not be surprised if others join us -- I know of at least one other person who rewrote Lucene in C: Robert Kirchgessner, who did a partial PHP/C port. Heck, it will presumably be easier to maintain a Python port against Lucy than against GCJ'd Lucene, provided that we achieve what we've set out to achieve. The only question remaining, I think, is whether the project will actually be hosted at Apache. When Dave and I approached Doug Cutting about it, he specifically requested that development take place there -- before Dave or I had had a chance to indicate that that was our preference as well. However, we've been waiting for approval by the Lucene PMC for a couple weeks now, and I'm not sure its coming. I'm guessing that Erik "One Lucene To Rule Them All" Hatcher hasn't cast his +1. ;) IMO, it would be best for everybody if we did this within the Lucene family, but we'll just have to see. Marvin Humphrey Rectangular Research http://www.rectangular.com/ [1] What constitutes a compiled vs. a dynamic language is debatable -- see . It might be more accurate to describe Java as a "more compiled" language. From marvin at rectangular.com Thu Jun 1 14:32:01 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Thu, 1 Jun 2006 11:32:01 -0700 Subject: [Ferret-talk] Windows progress In-Reply-To: <763AC6A3-E8B7-4342-9498-22E146065CAF@rectangular.com> References: <447E9B5E.9080009@blackkettle.org> <447ECF35.8070500@blackkettle.org> <763AC6A3-E8B7-4342-9498-22E146065CAF@rectangular.com> Message-ID: <8FF53956-3E1A-497F-91CD-AA61A0885EE3@rectangular.com> On Jun 1, 2006, at 11:00 AM, Marvin Humphrey wrote: > IMO, it would be best for everybody > if we did this within the Lucene family, ... and that what's going to happen. I just got an email from Doug. We're good to go. Thank you, Lucene PMC. :) Marvin Humphrey Rectangular Research http://www.rectangular.com/ From erik at ehatchersolutions.com Thu Jun 1 19:34:24 2006 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Thu, 1 Jun 2006 19:34:24 -0400 Subject: [Ferret-talk] Windows progress In-Reply-To: <8FF53956-3E1A-497F-91CD-AA61A0885EE3@rectangular.com> References: <447E9B5E.9080009@blackkettle.org> <447ECF35.8070500@blackkettle.org> <763AC6A3-E8B7-4342-9498-22E146065CAF@rectangular.com> <8FF53956-3E1A-497F-91CD-AA61A0885EE3@rectangular.com> Message-ID: You're welcome! I'm looking forward to Ruby Lucene goodness!!! Erik On Jun 1, 2006, at 2:32 PM, Marvin Humphrey wrote: > > On Jun 1, 2006, at 11:00 AM, Marvin Humphrey wrote: > >> IMO, it would be best for everybody >> if we did this within the Lucene family, > > ... and that what's going to happen. I just got an email from Doug. > We're good to go. > > Thank you, Lucene PMC. > > :) > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From wemitchell at gmail.com Fri Jun 2 15:16:30 2006 From: wemitchell at gmail.com (William Mitchell) Date: Fri, 2 Jun 2006 21:16:30 +0200 Subject: [Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum Message-ID: <4f70e224fda08440ae91bba7746fa2b8@ruby-forum.com> Ferret 0.9.3 Ruby 1.8.2 NOT storing file contents in the index. Only indexing first 25k of each file. Very large data set (1 million files, 350 Gb) Code based on snippet from David Balmain's forum posts. After 6 hours, Ferret bails out with Ruby "exceeds max file size". Cache: -rw-r--r-- 1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp -rw-r--r-- 1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx -rw-r--r-- 1 bill bill 646302802 2006-06-01 22:42 _ntc6.frq -rw-r--r-- 1 bill bill 165561698 2006-06-01 22:42 _ntc6.tis -rw-r--r-- 1 bill bill 50541430 2006-06-01 22:14 _ntc6.fdt -rw-r--r-- 1 bill bill 8000000 2006-06-01 22:14 _ntc6.fdx -rw-r--r-- 1 bill bill 2097842 2006-06-01 22:42 _ntc6.tii -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f0 -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f1 -rw-r--r-- 1 bill bill 30 2006-06-01 22:42 segments -rw-r--r-- 1 bill bill 16 2006-06-01 22:14 _ntc6.fnm Code: #------------ index = Index::Index.new(:path => "/var/cache/ferrets") max_file_length = 25000 Dir.glob(allfiles).each do |file| doc = Document::Document.new() doc << Document::Field.new(:file, file, Document::Field::Store::YES, Document::Field::Index::UNTOKENIZED) doc << Document::Field.new(:content, IO.read(file, max_file_length), Document::Field::Store::NO, Document::Field::Index::TOKENIZED) index << doc end #------------ Is there a workaround, or is this exceeding Ferret's limits? Thanks! By the way, retrieval is usably fast for my purposes, even on a big index like this. Very impressive. -- Posted via http://www.ruby-forum.com/. From carmen at whats-your.name Fri Jun 2 17:39:28 2006 From: carmen at whats-your.name (carmen) Date: Fri, 2 Jun 2006 23:39:28 +0200 Subject: [Ferret-talk] acts_as_ferret 0.2.1 segfault In-Reply-To: <8bc6d8730605240533l1d3e869cxc7063234e08d6d01@mail.gmail.com> References: <8bc6d8730605231105u11c22d94v94a6ced772915b3@mail.gmail.com> <562a35c10605231236x7316e44ayf37c3bb47b0e49a@mail.gmail.com> <562a35c10605231247q692be2ees6aeb84b520e5b5b9@mail.gmail.com> <8bc6d8730605231322h2e6b2e43t9a91cfd400aaa9ea@mail.gmail.com> <562a35c10605231514y2da58adfs717bdf552270cd70@mail.gmail.com> <1b45d5a7fa8ce7cab615c8d30ca10b14@ruby-forum.com> <562a35c10605232324v4e9e5320n55c132067203534@mail.gmail.com> <8bc6d8730605240533l1d3e869cxc7063234e08d6d01@mail.gmail.com> Message-ID: John Andrews wrote: > I followed Jordan's steps and the tests all pass now. Thanks Jordan! > Thanks Jan! i get this error as well, with the trunk of acts_as_ferret, and whatever ferret was installed via gems (0.9.3 i think). watching that install, it built the C extensions, and the makefile had the systemwide defaults -O2 -pipe. checked out the trunk of ferret, and it refuses to build the C extensions due to missing header errors (headers from its own files no less) but at least thats solves the problem, since the segfault is clearly coming from the C code somewhere. obviously that doenst help narrow it down much. thanks for the tips anyways .. -- Posted via http://www.ruby-forum.com/. From carmen at whats-your.name Fri Jun 2 17:46:26 2006 From: carmen at whats-your.name (carmen) Date: Fri, 2 Jun 2006 23:46:26 +0200 Subject: [Ferret-talk] acts_as_ferret 0.2.1 segfault In-Reply-To: References: <8bc6d8730605231105u11c22d94v94a6ced772915b3@mail.gmail.com> <562a35c10605231236x7316e44ayf37c3bb47b0e49a@mail.gmail.com> <562a35c10605231247q692be2ees6aeb84b520e5b5b9@mail.gmail.com> <8bc6d8730605231322h2e6b2e43t9a91cfd400aaa9ea@mail.gmail.com> <562a35c10605231514y2da58adfs717bdf552270cd70@mail.gmail.com> <1b45d5a7fa8ce7cab615c8d30ca10b14@ruby-forum.com> <562a35c10605232324v4e9e5320n55c132067203534@mail.gmail.com> <8bc6d8730605240533l1d3e869cxc7063234e08d6d01@mail.gmail.com> Message-ID: <6e155f00031080796c2cf466e05f12b6@ruby-forum.com> the C extension is littered with these warnings: warning: cast to pointer from integer of different size which can be harmless, or not, but ive definitely seen these warnings gradually disappear from other stuff after running gentoo-amd64 for a while.. gdb just does this: Program received signal SIGSEGV, Segmentation fault. Cannot remove breakpoints because program is no longer writable. It might be running in another process. Further execution is probably impossible. 0x00002b670aad0337 in ?? () )too lazy to compile ruby with -g to see if its more informative..( cheers -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Fri Jun 2 19:20:31 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Jun 2006 08:20:31 +0900 Subject: [Ferret-talk] acts_as_ferret 0.2.1 segfault In-Reply-To: References: <8bc6d8730605231105u11c22d94v94a6ced772915b3@mail.gmail.com> <562a35c10605231236x7316e44ayf37c3bb47b0e49a@mail.gmail.com> <562a35c10605231247q692be2ees6aeb84b520e5b5b9@mail.gmail.com> <8bc6d8730605231322h2e6b2e43t9a91cfd400aaa9ea@mail.gmail.com> <562a35c10605231514y2da58adfs717bdf552270cd70@mail.gmail.com> <1b45d5a7fa8ce7cab615c8d30ca10b14@ruby-forum.com> <562a35c10605232324v4e9e5320n55c132067203534@mail.gmail.com> <8bc6d8730605240533l1d3e869cxc7063234e08d6d01@mail.gmail.com> Message-ID: On 6/3/06, carmen wrote: > John Andrews wrote: > > I followed Jordan's steps and the tests all pass now. Thanks Jordan! > > Thanks Jan! > > i get this error as well, with the trunk of acts_as_ferret, and whatever > ferret was installed via gems (0.9.3 i think). watching that install, it > built the C extensions, and the makefile had the systemwide defaults -O2 > -pipe. checked out the trunk of ferret, and it refuses to build the C > extensions due to missing header errors (headers from its own files no > less) but at least thats solves the problem, since the segfault is > clearly coming from the C code somewhere. You need to run `rake ext` to copy the missing headers to the right place. This won't help you with your segfault though. From dbalmain.ml at gmail.com Fri Jun 2 19:22:04 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Jun 2006 08:22:04 +0900 Subject: [Ferret-talk] acts_as_ferret 0.2.1 segfault In-Reply-To: <6e155f00031080796c2cf466e05f12b6@ruby-forum.com> References: <8bc6d8730605231105u11c22d94v94a6ced772915b3@mail.gmail.com> <562a35c10605231247q692be2ees6aeb84b520e5b5b9@mail.gmail.com> <8bc6d8730605231322h2e6b2e43t9a91cfd400aaa9ea@mail.gmail.com> <562a35c10605231514y2da58adfs717bdf552270cd70@mail.gmail.com> <1b45d5a7fa8ce7cab615c8d30ca10b14@ruby-forum.com> <562a35c10605232324v4e9e5320n55c132067203534@mail.gmail.com> <8bc6d8730605240533l1d3e869cxc7063234e08d6d01@mail.gmail.com> <6e155f00031080796c2cf466e05f12b6@ruby-forum.com> Message-ID: On 6/3/06, carmen wrote: > the C extension is littered with these warnings: > > warning: cast to pointer from integer of different size These will be fixed in a future version. Hopefully the next version. From dbalmain.ml at gmail.com Fri Jun 2 19:47:48 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Jun 2006 08:47:48 +0900 Subject: [Ferret-talk] Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum In-Reply-To: <4f70e224fda08440ae91bba7746fa2b8@ruby-forum.com> References: <4f70e224fda08440ae91bba7746fa2b8@ruby-forum.com> Message-ID: On 6/3/06, William Mitchell wrote: > Ferret 0.9.3 > Ruby 1.8.2 > NOT storing file contents in the index. > Only indexing first 25k of each file. > Very large data set (1 million files, 350 Gb) > Code based on snippet from David Balmain's forum posts. > > After 6 hours, Ferret bails out with Ruby "exceeds max file size". > > Cache: > > -rw-r--r-- 1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp > -rw-r--r-- 1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx > -rw-r--r-- 1 bill bill 646302802 2006-06-01 22:42 _ntc6.frq > -rw-r--r-- 1 bill bill 165561698 2006-06-01 22:42 _ntc6.tis > -rw-r--r-- 1 bill bill 50541430 2006-06-01 22:14 _ntc6.fdt > -rw-r--r-- 1 bill bill 8000000 2006-06-01 22:14 _ntc6.fdx > -rw-r--r-- 1 bill bill 2097842 2006-06-01 22:42 _ntc6.tii > -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f0 > -rw-r--r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f1 > -rw-r--r-- 1 bill bill 30 2006-06-01 22:42 segments > -rw-r--r-- 1 bill bill 16 2006-06-01 22:14 _ntc6.fnm > > Code: > > #------------ > > index = Index::Index.new(:path => "/var/cache/ferrets") > > max_file_length = 25000 > > Dir.glob(allfiles).each do > |file| > doc = Document::Document.new() > doc << Document::Field.new(:file, file, > Document::Field::Store::YES, > Document::Field::Index::UNTOKENIZED) > doc << Document::Field.new(:content, IO.read(file, max_file_length), > Document::Field::Store::NO, > Document::Field::Index::TOKENIZED) > index << doc > end > > #------------ > > Is there a workaround, or is this exceeding Ferret's limits? You need to set :max_merge_docs when you create the index. This will stop the index merging segments when it gets to a certain size. This will also mean that you will always have multiple segments in your index which will slow things down a little but it shouldn't be a problem. Judging by the filenames you've almost merged 1,000,000 documents by the time it fails ("ntc6".to_i(36) = 1,111,110 = 1,000,000 documents and 111,110 merges). Looks like you are pretty close to finishing. So if you create your index like this it should work; index = Index::Index.new(:path => "/var/cache/ferrets", :max_merge_docs => 100_000) This will leave you with at least 10 segments at the end. You could also set max_merge_docs to 500_000 and run index.optimize at the end. This should keep you under the max file size and with 2-3 segments, searching should be easily fast enough. As an aside, you can also set :max_field_length (default 10,000) to limit the number of terms that get indexed from any one document instead of truncating the file to 25,000 bytes. The will prevent you getting a half term at the end of the document as 25,000 might break in the middle of a word. It shouldn't effect search results too much however so you can keep doing it this way. In a future version you'll be able to pass a File handle instead of a string in which case it will definitly be better to set :max_field_length. > Thanks! By the way, retrieval is usably fast for my purposes, even on a > big index like this. Very impressive. Thanks. Please let me know how it goes. This is possibly the largest document set to be indexed with Ferret so far. Cheers, Dave From dbalmain.ml at gmail.com Sat Jun 3 23:42:14 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sun, 4 Jun 2006 12:42:14 +0900 Subject: [Ferret-talk] Proposal of some radical changes to API Message-ID: Hey guys, Now that the Lucy[1] project has Apache approval and is about to begin, the onus is no longer on Ferret to strive for Lucene compatability. (We'll be doing that in Lucy). So I'm starting to think about ways to improve Ferret's API. The first part that needs to be improved is the Document API. It's annoying having to type all the attributes to initialize a field just to change the boost. So; field = Field.new(:name, "data...", Field::Store::YES, Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0) would become; field = Field.new(:name, "data...", :index => Field::Index::TOKENIZED, :boost => 5.0) It'd also be nice to replace the Parameter objects with symbols; field = Field.new(:name, "data...", :index => :tokenized, :boost => 5.0) Of course, this raises the question, why do we need to specify that field :name is tokenized every time we create a :name field? Isn't it always going to be the same? What if we use a different value the next time we and a :name field? Well the answer to this last question is a specific set of rules; 1. Once you choose to index a field, that field is always indexed from that point forward. 2. Once you store term vectors, always store term vectors 3. Once you store positions, always store positions 4. Once you store offsets, always store offsets 5. Once you store norms, always store norms So currently if you add a field like this (I'll use the newer notation as it's easier to type); doc << Field.new(:field, "data...", :index => :yes, :term_vector => :with_positions_offsets) And later add a field like this; doc << Field.new(:field, "diff...", :index => :no, :term_vector => :no) This field will be indexed and it's term vectors will be stored regardless. This is good because if you are using TermVectors in a particular field then you probably expect them to be there for all instances of that field. The problem is that earlier documents will have been added without storing term vectors. Now I don't know the exact thinking behind these rules but it seems to me that it would be better to just keep whatever rule you used when you first added the document. If you want to add term vectors later, then re-index. So here's my radical api change proposal. You set a fields properties when you create the index and Document becomes (almost) a simple Hash object. Actually, you may not have realized this, but you can almost do this currently in Ferret. Once you add the first instance of a field, that field's properties are set. From then on you and just add documents as Hash objects and each field will have the same properties as in that first document that was added. (This isn't true of the Store or boost properties. These are set on a per document basis.) So here is a possible way example of the way I'd implement this; # the following might even look better in a YAML file. field_props = { :default => {:store => :no, :index => :tokenized, :term_vector => :no}, :fields => { :id => {:store => :yes, :index => :no}, :title => {:store => :yes, :term_vector => :with_positions_offsets}, [:created_on, :updated_on] => {:store => :yes, :index => :untokenized} } } index = Index.new(:field_properties => field_props) # ... # And if later, you want to add a new field index.add_field(:image, {:store => :compressed, :index => :no}) Now you would just create Hashes instead of Documents. The only exception would be if you needed to set the boost for a particular field or document. So you would have this; index << {:title => "title", :data => "data..."} # boost a field index << {:title => Field.new("important title", 50.0), :data => "normal data"} # boost a document index << Document.new({:title => "important doc", :data => "data"}, 100.0) So what do you all think? These are just ideas at the moment and it'd be a while before I could actually implement them. And don't worry, I'll do my best to keep backwards compatibility. Please give me your feedback. Cheers, Dave [1] - http://wiki.apache.org/jakarta-lucene/LucyProposal From JanPrill at blauton.de Sun Jun 4 06:18:02 2006 From: JanPrill at blauton.de (Jan Prill) Date: Sun, 4 Jun 2006 12:18:02 +0200 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <562a35c10606040315u695c678bt72454af66a7fec83@mail.gmail.com> References: <562a35c10606040315u695c678bt72454af66a7fec83@mail.gmail.com> Message-ID: <562a35c10606040318s63079476hc956c226e69f159d@mail.gmail.com> Posted it first from my gmail adress (with which I'm not subscribed to the mailing list) and therefore got an approval message. So here it comes again: Hi, Dave, first of all: congrats to you and Marvin for being approved by ASF. I think this is great because it ensures an even more prosper future for lucene ports to dynamic languages. The apache license is a great one and ensures true open source and all freedom that any user of the c-core would want for his projects, regardless if they are free, private, commercial or whatever. Apache is providing great software. Wouldn't know what I would have done in many projects without their webserver, xml-libraries, lucene, nutch et al. This must be a great place to be for a software developer, especially working with a "search legend" like Doug Cutting sounds like a great opportunity. Congrats again! The notation you are explaining looks great. The field props should indeed stay in a yaml file, the notation above looks a little too stuffed imho. But I'm really looking forward to the index creation with hashes. I think the rails crowed would love this, because it will look and feel railish to do things this way... Regards Jan On 6/4/06, Jan Prill wrote: > > Hi, Dave, > > first of all: congrats to you and Marvin for being approved by ASF. I > think this is great because it ensures an even more prosper future for > lucene ports to dynamic languages. The apache license is a great one and > ensures true open source and all freedom that any user of the c-core would > want for his projects, regardless if they are free, private, commercial or > whatever. > > Apache is providing great software. Wouldn't know what I would have done > in many projects without their webserver, xml-libraries, lucene, nutch et > al. This must be a great place to be for a software developer, especially > working with a "search legend" like Doug Cutting sounds like a great > opportunity. Congrats again! > > The notation you are explaining looks great. The field props should indeed > stay in a yaml file, the notation above looks a little too stuffed imho. But > I'm really looking forward to the index creation with hashes. I think the > rails crowed would love this, because it will look and feel railish to do > things this way... > > Regards > Jan > > > On 6/4/06, David Balmain wrote: > > > > Hey guys, > > > > Now that the Lucy[1] project has Apache approval and is about to > > begin, the onus is no longer on Ferret to strive for Lucene > > compatability. (We'll be doing that in Lucy). So I'm starting to think > > about ways to improve Ferret's API. The first part that needs to be > > improved is the Document API. It's annoying having to type all the > > attributes to initialize a field just to change the boost. So; > > > > field = Field.new(:name, "data...", Field::Store::YES, > > Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0) > > > > would become; > > > > field = Field.new(:name, "data...", :index => > > Field::Index::TOKENIZED, :boost => 5.0) > > > > It'd also be nice to replace the Parameter objects with symbols; > > > > field = Field.new(:name, "data...", :index => :tokenized, :boost => > > 5.0) > > > > Of course, this raises the question, why do we need to specify that > > field :name is tokenized every time we create a :name field? Isn't it > > always going to be the same? What if we use a different value the next > > time we and a :name field? Well the answer to this last question is a > > specific set of rules; > > > > 1. Once you choose to index a field, that field is always indexed > > from that point forward. > > 2. Once you store term vectors, always store term vectors > > 3. Once you store positions, always store positions > > 4. Once you store offsets, always store offsets > > 5. Once you store norms, always store norms > > > > So currently if you add a field like this (I'll use the newer notation > > as it's easier to type); > > > > doc << Field.new(:field, "data...", :index => :yes, :term_vector > > => :with_positions_offsets) > > > > And later add a field like this; > > > > doc << Field.new(:field, "diff...", :index => :no, :term_vector => > > :no) > > > > This field will be indexed and it's term vectors will be stored > > regardless. This is good because if you are using TermVectors in a > > particular field then you probably expect them to be there for all > > instances of that field. The problem is that earlier documents will > > have been added without storing term vectors. Now I don't know the > > exact thinking behind these rules but it seems to me that it would be > > better to just keep whatever rule you used when you first added the > > document. If you want to add term vectors later, then re-index. > > > > So here's my radical api change proposal. You set a fields properties > > when you create the index and Document becomes (almost) a simple Hash > > object. Actually, you may not have realized this, but you can almost > > do this currently in Ferret. Once you add the first instance of a > > field, that field's properties are set. From then on you and just add > > documents as Hash objects and each field will have the same properties > > as in that first document that was added. (This isn't true of the > > Store or boost properties. These are set on a per document basis.) > > > > So here is a possible way example of the way I'd implement this; > > > > # the following might even look better in a YAML file. > > field_props = { > > :default => {:store => :no, :index => :tokenized, :term_vector > > => :no}, > > :fields => { > > :id => {:store => :yes, :index => :no}, > > :title => {:store => :yes, :term_vector => > > :with_positions_offsets}, > > [:created_on, :updated_on] => {:store => :yes, :index => > > :untokenized} > > } > > } > > index = Index.new(:field_properties => field_props) > > > > # ... > > # And if later, you want to add a new field > > index.add_field(:image, {:store => :compressed, :index => :no}) > > > > Now you would just create Hashes instead of Documents. The only > > exception would be if you needed to set the boost for a particular > > field or document. So you would have this; > > > > index << {:title => "title", :data => "data..."} > > # boost a field > > index << {:title => Field.new("important title", 50.0), :data => > > "normal data"} > > # boost a document > > index << Document.new({:title => "important doc", :data => "data"}, > > 100.0) > > > > So what do you all think? These are just ideas at the moment and it'd > > be a while before I could actually implement them. And don't worry, > > I'll do my best to keep backwards compatibility. Please give me your > > feedback. > > > > Cheers, > > Dave > > > > [1] - http://wiki.apache.org/jakarta-lucene/LucyProposal > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060604/4aa7ff01/attachment-0001.htm From Neville.Burnell at bmsoft.com.au Sun Jun 4 22:33:13 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Mon, 5 Jun 2006 12:33:13 +1000 Subject: [Ferret-talk] Proposal of some radical changes to API Message-ID: <126EC586577FD611A28E00A0C9A037587E6BE5@maui.bmsoft.com.au> Hi Dave, Congrats on getting Lucy approved! WRT the proposed Ferret api changes, is there a good reason you chose :yes/:no as opposed to true/false for some of the boolean settings? Kind Regards Neville Burnell From Neville.Burnell at bmsoft.com.au Sun Jun 4 22:58:45 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Mon, 5 Jun 2006 12:58:45 +1000 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... Message-ID: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> Hi and thanks for Ferret! I'm wondering if it would be possible to create a Ferret Win32 gem which includes the c performance code pre-compiled for those of us without a C compiler handy ? Zed Shaw seems to have cracked this particular nut with his Mongrel Win32 gem. Alternately, is there a zip of the Win32 .so Ferret needs that I could download and manually install? Kind Regards Neville Burnell -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060605/84e05428/attachment.htm From dbalmain.ml at gmail.com Sun Jun 4 23:16:40 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 5 Jun 2006 12:16:40 +0900 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <126EC586577FD611A28E00A0C9A037587E6BE5@maui.bmsoft.com.au> References: <126EC586577FD611A28E00A0C9A037587E6BE5@maui.bmsoft.com.au> Message-ID: On 6/5/06, Neville Burnell wrote: > Hi Dave, > > Congrats on getting Lucy approved! > > WRT the proposed Ferret api changes, is there a good reason you chose > :yes/:no as opposed to true/false for some of the boolean settings? Hi Neville, I don't know if it's a "good" reason. That's up to you to decide. The reason is that yes and no aren't the only options. For example, :store could be :yes, :no, :compressed and :term_vector could be :yes, :no, :with_positions, :with_offsets and :with_positions_and_offsets. It would seem strange to me to have the choices true, false and :compress. I hope that makes sense. Cheers, Dave > Kind Regards > > Neville Burnell > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Sun Jun 4 23:22:19 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 5 Jun 2006 12:22:19 +0900 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... In-Reply-To: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> References: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> Message-ID: On 6/5/06, Neville Burnell wrote: > > > > Hi and thanks for Ferret! > > I'm wondering if it would be possible to create a Ferret Win32 gem which > includes the c performance code pre-compiled for those of us without a C > compiler handy ? Unfortunately not yet. Alex Young may be working on it. The problem is that Ferret currently doesn't compile under Visual C 6 so there is some porting that needs to be done. I've started an experimental version of Ferret fresh (with non-lucene-compatible changes) and am compiling it under VC6 as I go so once that is finished there will definitely be a windows version of Ferret. This could take a while. Also, the Lucy project will definitely be designed to work under VC6 so a windows version of Ferret is coming. It's just hard to say when. > Zed Shaw seems to have cracked this particular nut with his Mongrel Win32 > gem. > > Alternately, is there a zip of the Win32 .so Ferret needs that I could > download and manually install? > > Kind Regards > > Neville Burnell > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > > From Neville.Burnell at bmsoft.com.au Sun Jun 4 23:25:56 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Mon, 5 Jun 2006 13:25:56 +1000 Subject: [Ferret-talk] Proposal of some radical changes to API Message-ID: <126EC586577FD611A28E00A0C9A037587E6BEA@maui.bmsoft.com.au> >> The reason is that yes and no aren't the only options. Yep, that's a good reason I thought that was the case, but I wasn't sure. Cheers, Neville From marvin at rectangular.com Mon Jun 5 00:35:15 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Sun, 4 Jun 2006 21:35:15 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: References: Message-ID: On Jun 3, 2006, at 8:42 PM, David Balmain wrote: > Now that the Lucy[1] project has Apache approval and is about to > begin, the onus is no longer on Ferret to strive for Lucene > compatability. (We'll be doing that in Lucy). We'll take this up more aggressively once some under-appreciated volunteers at Apache create mailing lists and other infrastructure for Lucy, but I doubt we'll want to have Lucy's API mirror Lucene's 100%. Do we really want separate Hits and HitIterator classes, for instance? Other, more substantial issues are on the table too as far as I'm concerned, such as whether deletions should handled by the IndexReader rather than the IndexWriter. APIs are really hard to change once defined, and not taking hard-won lessons from Lucene into account would be a crime. Lucy will definitely need a define-fields-once interface, so although you're proposing stuff specifically for Ferret here, I'm studying it with an eye towards using it with Lucy. My inclination is to start with define-fields-once, then add dynamic field definitions later if we have to. > Of course, this raises the question, why do we need to specify that > field :name is tokenized every time we create a :name field? Isn't it > always going to be the same? The primary argument for allowing dynamic field definitions I've seen, is not that the definition might change, but that each document might contain previously undefined fields which are unknowable in advance. The CNET/Solr folks, Yonik and Hoss, really, really care about that. I think the idea of dynamic field definitions is weird. (A database that allows you to change the table definition with each INSERT? Huh?) I'm sure that CNet could have been done another way if dynamic field definitions hadn't been available, but they're committed now. :( > What if we use a different value the next > time we and a :name field? Well the answer to this last question is a > specific set of rules; > > 1. Once you choose to index a field, that field is > always indexed from that point forward. > 2. Once you store term vectors, always store term > vectors > 3. Once you store positions, always store positions > 4. Once you store offsets, always store offsets > 5. Once you store norms, always store norms It's actually messier than that, isn't it? Just because you've started marking a field as indexed doesn't mean that Lucene goes back to all the documents that you've already processed and indexes that field. Same deal with TermVectors, etc. At least in SQL, when you add a field to a table it goes and adds a default value for every row. > The problem is that earlier documents will > have been added without storing term vectors. Now I don't know the > exact thinking behind these rules but it seems to me that it would be > better to just keep whatever rule you used when you first added the > document. If you want to add term vectors later, then re-index. 'Zactly! > So here's my radical api change proposal. You set a fields properties > when you create the index and Document becomes (almost) a simple Hash > object. KinoSearch thinks of documents like hashes, too. Lucene, however, thinks of documents like arrays. > Actually, you may not have realized this, but you can almost > do this currently in Ferret. Once you add the first instance of a > field, that field's properties are set. From then on you and just add > documents as Hash objects and each field will have the same properties > as in that first document that was added. (This isn't true of the > Store or boost properties. These are set on a per document basis.) Why not set Store once and for all per-field? And heck, why not start with a default boost, but allow it to be overridden? > So here is a possible way example of the way I'd implement this; > > # the following might even look better in a YAML file. Ooo, nifty idea! How about a class whose sole purpose is to define fields and generate the YAML file? Or, if we're thinking future Lucene 2.1 file format, some Lucene-readable index definition file? > field_props = { > :default => {:store => :no, :index > => :tokenized, :term_vector => :no}, > :fields => { > :id => {:store => :yes, :index => :no}, > :title => {:store => :yes, :term_vector > => :with_positions_offsets}, > [:created_on, :updated_on] => {:store => :yes, :index => > :untokenized} > } > } > index = Index.new(:field_properties => field_props) This is nice and dense, but maybe a tad complicated. KinoSearch's take on doing field defs has some problems too. It was a mistake to make spec_field() a method of InvIndexer (KinoSearch's index writer/modifier class). The index writer and reader classes suffer from serious bloat no matter what, so anything that can be shunted somewhere else should be. Marvin Humphrey Rectangular Research http://www.rectangular.com/ From dbalmain.ml at gmail.com Mon Jun 5 01:46:23 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 5 Jun 2006 14:46:23 +0900 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: References: Message-ID: On 6/5/06, Marvin Humphrey wrote: > > On Jun 3, 2006, at 8:42 PM, David Balmain wrote: > > > Now that the Lucy[1] project has Apache approval and is about to > > begin, the onus is no longer on Ferret to strive for Lucene > > compatability. (We'll be doing that in Lucy). > > We'll take this up more aggressively once some under-appreciated > volunteers at Apache create mailing lists and other infrastructure > for Lucy, but I doubt we'll want to have Lucy's API mirror Lucene's > 100%. Do we really want separate Hits and HitIterator classes, for > instance? Other, more substantial issues are on the table too as far > as I'm concerned, such as whether deletions should handled by the > IndexReader rather than the IndexWriter. APIs are really hard to > change once defined, and not taking hard-won lessons from Lucene into > account would be a crime. Thanks for pointing that out. I couldn't agree with you more. What I meant was that Lucy would be striving to maintain "index file format" compatibility (which I believe was the plan). I didn't make this very clear, though, as I was talking about changes to the API but as I was writing this I was thinking about what changes to the index file format would allow. > Lucy will definitely need a define-fields-once interface, so although > you're proposing stuff specifically for Ferret here, I'm studying it > with an eye towards using it with Lucy. My inclination is to start > with define-fields-once, then add dynamic field definitions later if > we have to. This sounds good to me. > > Of course, this raises the question, why do we need to specify that > > field :name is tokenized every time we create a :name field? Isn't it > > always going to be the same? > > The primary argument for allowing dynamic field definitions I've > seen, is not that the definition might change, but that each document > might contain previously undefined fields which are unknowable in > advance. The CNET/Solr folks, Yonik and Hoss, really, really care > about that. > > I think the idea of dynamic field definitions is weird. (A database > that allows you to change the table definition with each INSERT? > Huh?) I'm sure that CNet could have been done another way if dynamic > field definitions hadn't been available, but they're committed now. :( Actually, I fall into the category of people who like dynamic field definitions. I agree that they are not necessary but it certainly makes some things easy. For instance, in a rails application you can add models to an index and you get to specify within the model itself which of its fields will be added to the index. The index itself doesn't need to know which models will be indexed or how they will be indexed it just needs to know to store the id field and the model name field and index everything else. It's all about keeping it DRY. The part I don't like about lucene is *sometimes* being able to change a fields properties. > > What if we use a different value the next > > time we and a :name field? Well the answer to this last question is a > > specific set of rules; > > > > 1. Once you choose to index a field, that field is > > always indexed from that point forward. > > 2. Once you store term vectors, always store term > > vectors > > 3. Once you store positions, always store positions > > 4. Once you store offsets, always store offsets > > 5. Once you store norms, always store norms > > It's actually messier than that, isn't it? Just because you've > started marking a field as indexed doesn't mean that Lucene goes back > to all the documents that you've already processed and indexes that > field. Same deal with TermVectors, etc. > > At least in SQL, when you add a field to a table it goes and adds a > default value for every row. > > > The problem is that earlier documents will > > have been added without storing term vectors. Now I don't know the > > exact thinking behind these rules but it seems to me that it would be > > better to just keep whatever rule you used when you first added the > > document. If you want to add term vectors later, then re-index. > > 'Zactly! > > > So here's my radical api change proposal. You set a fields properties > > when you create the index and Document becomes (almost) a simple Hash > > object. > > KinoSearch thinks of documents like hashes, too. Lucene, however, > thinks of documents like arrays. > > > Actually, you may not have realized this, but you can almost > > do this currently in Ferret. Once you add the first instance of a > > field, that field's properties are set. From then on you and just add > > documents as Hash objects and each field will have the same properties > > as in that first document that was added. (This isn't true of the > > Store or boost properties. These are set on a per document basis.) > > Why not set Store once and for all per-field? And heck, why not > start with a default boost, but allow it to be overridden? My plan exactly. In my experimental version of Ferret I have a fields file along with the segments file. The fields file stores all the field metadata such as store, index, term-vector and field boosts. That way there is no need to maintain a separate FieldInfos file per segment. (This will make merging a lot more difficult but I'm still thinking about that one.) > > So here is a possible way example of the way I'd implement this; > > > > # the following might even look better in a YAML file. > > Ooo, nifty idea! How about a class whose sole purpose is to define > fields and generate the YAML file? Or, if we're thinking future > Lucene 2.1 file format, some Lucene-readable index definition file? Now this idea I like. Perhaps even a simple question/answer app to generate the index definition file. I'd guess that Lucene will probably end up going with XML rather than YAML. Cheers, Dave From alex at blackkettle.org Mon Jun 5 04:23:46 2006 From: alex at blackkettle.org (Alex Young) Date: Mon, 05 Jun 2006 09:23:46 +0100 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... In-Reply-To: References: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> Message-ID: <4483EA12.3020003@blackkettle.org> David Balmain wrote: > On 6/5/06, Neville Burnell wrote: > >> >> >>Hi and thanks for Ferret! >> >>I'm wondering if it would be possible to create a Ferret Win32 gem which >>includes the c performance code pre-compiled for those of us without a C >>compiler handy ? > > > Unfortunately not yet. Alex Young may be working on it. The problem is > that Ferret currently doesn't compile under Visual C 6 so there is > some porting that needs to be done. Indeed I am. Or shall be this week. I'll be trying for something on Friday at the latest, but I can't promise that what I come up with will be useful to anyone but me... I will report back either way, though. -- Alex From fcsmith at gmail.com Mon Jun 5 10:48:21 2006 From: fcsmith at gmail.com (Finn Smith) Date: Mon, 5 Jun 2006 10:48:21 -0400 Subject: [Ferret-talk] Windows progress In-Reply-To: <8FF53956-3E1A-497F-91CD-AA61A0885EE3@rectangular.com> References: <447E9B5E.9080009@blackkettle.org> <447ECF35.8070500@blackkettle.org> <763AC6A3-E8B7-4342-9498-22E146065CAF@rectangular.com> <8FF53956-3E1A-497F-91CD-AA61A0885EE3@rectangular.com> Message-ID: <6e72bbd70606050748g1ae03d58w24ea5415bc6a9068@mail.gmail.com> On 6/1/06, Marvin Humphrey wrote: > > On Jun 1, 2006, at 11:00 AM, Marvin Humphrey wrote: > > > IMO, it would be best for everybody > > if we did this within the Lucene family, > > ... and that what's going to happen. I just got an email from Doug. > We're good to go. > > Thank you, Lucene PMC. > > :) Other than the initial proposal, any pointers to websites or mailing lists where we can track the development of this project? Thanks. -F From marvin at rectangular.com Mon Jun 5 11:17:27 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Mon, 5 Jun 2006 08:17:27 -0700 Subject: [Ferret-talk] Windows progress In-Reply-To: <6e72bbd70606050748g1ae03d58w24ea5415bc6a9068@mail.gmail.com> References: <447E9B5E.9080009@blackkettle.org> <447ECF35.8070500@blackkettle.org> <763AC6A3-E8B7-4342-9498-22E146065CAF@rectangular.com> <8FF53956-3E1A-497F-91CD-AA61A0885EE3@rectangular.com> <6e72bbd70606050748g1ae03d58w24ea5415bc6a9068@mail.gmail.com> Message-ID: On Jun 5, 2006, at 7:48 AM, Finn Smith wrote: > Other than the initial proposal, any pointers to websites or mailing > lists where we can track the development of this project? We're now waiting for our Apache accounts to be set up, the mailing lists and the subversion repositories to be created, etc. If I'm not mistaken, all the infrastructure support work at Apache is done by volunteers, so patience is the watchword. Once there is a Lucy mailing list, we'll send a notification to this list. Not a lot is going on right now besides the occasional spasm of high- level planning on either the KinoSearch list or the Ferret list. That's because in order to avoid the Apache Incubator process, all development needs to take place "on the record" in Apache forums and repositories. Marvin Humphrey Rectangular Research http://www.rectangular.com/ From marvin at rectangular.com Mon Jun 5 12:48:05 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Mon, 5 Jun 2006 09:48:05 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: References: Message-ID: <3F5173D8-AB68-4D13-A194-D6CDB819D5B7@rectangular.com> On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > What I > meant was that Lucy would be striving to maintain "index file format" > compatibility (which I believe was the plan). It's funny that we haven't actually settled that. I used to think index compatibility was really important, but I don't so much any more. Index compatibility is DOA unless Lucene adopts bytecounts as string headers, because it would be insanity for Lucy to deal with the current format. So we're talking compatibility no sooner than Lucene 2.1, and adapting Lucene will be a challenge. I think the only way to make up the lost speed is to pry in the KinoSearch merge model. I strongly suspect that that will prove to be a marked improvement over not just the patched version, but the current release. However... It's a lot of work, and I think I'm the only obvious candidate with both the expertise and (maybe) the desire to do it, unless you want to take it on. Two stages out of four are complete. The bytecounts patch was stage 1, and last night I supplied stage 2: a Java port of KinoSearch's external sorting module. Stage 3 is adapting Lucene's indexing apparatus to write indexes by the segment rather than the document -- porting KinoSearch's SegWriter module and eliminating DocumentWriter and SegmentMerger would be a start. The last stage is adapting everything to be backwards compatible with char-counts as string headers. I'm not sure that I want to dedicate that much of my time to Lucene, at least not right now. The changes outlined above are pretty major. It's likely that some bugs will get introduced simply because of the volume of code change, so that's an argument against making any change at all unless there's a real benefit. There would be -- the KinoSearch merge model is faster -- but politically speaking, selling the whole package to the Lucene community would be a PITA. Not only do I have to argue that the tangible benefits justify the disruption, I have to make the argument that it's not OK for compatibility to begin and end with Java[1][2], plus deal with outright hostility and abuse from extreme Java partisans[3]. I'd rather spend my time and energy contributing to Lucy. Besides, I think that ultimately, trying to be compatible with other ports would be as much of a drag on Lucy as Lucene, and I think it's advisable for both projects to declare their file formats private. The Lucene file format is just too complex and difficult to serve as a good interchange medium. The only major reason for Lucy to be file-format-compatible with Lucene is Luke. IMO, if we want Luke's benefits, we should be hacking Luke. Marvin Humphrey Rectangular Research http://www.rectangular.com/ [1] http://xrl.us/m2o3 (Link to mail-archives.apache.org) [2] http://xrl.us/m2o7 (Link to mail-archives.apache.org) [3] http://xrl.us/m2kp (Link to mail-archives.apache.org) From marvin at rectangular.com Mon Jun 5 13:28:36 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Mon, 5 Jun 2006 10:28:36 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: References: Message-ID: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > In my experimental version of Ferret I have a fields > file along with the segments file. The fields file stores all the > field metadata such as store, index, term-vector and field boosts. > That way there is no need to maintain a separate FieldInfos file per > segment. (This will make merging a lot more difficult but I'm still > thinking about that one.) Robert Kirchgessner made a similar proposal: http://xrl.us/m2qq (Link to mail-archives.apache.org) Robert addresses the merging issue in a subsequent email, and I think his arguments are compelling. IMO, field defs should be immutable and consistent over the entire index. >>> So here is a possible way example of the way I'd implement this; >>> >>> # the following might even look better in a YAML file. >> >> Ooo, nifty idea! How about a class whose sole purpose is to define >> fields and generate the YAML file? Or, if we're thinking future >> Lucene 2.1 file format, some Lucene-readable index definition file? > > Now this idea I like. Perhaps even a simple question/answer app to > generate the index definition file. I'd guess that Lucene will > probably end up going with XML rather than YAML. I think it would be a binary file, using Lucene's standard writeString, writeVInt, etc. methods. A question/answer app could easily be built based around a module. How does "IndexCreator" sound? Take away the ability of the IndexWriter module to create or redefine indexes, and encapsulate that functionality within one module. Using Java as our lingua franca... IndexCreator creator = new IndexCreator(filePath); FieldDefinition titleDef = new FieldDefinition("title", Field.Store.YES, Field.Index.TOKENIZED); FieldDefinition bodyDef = new FieldDefinition("body", Field.Store.YES, Field.Index.TOKENIZED Field.TermVector.YES); creator.addFieldDefinition(titleDef); creator.addFieldDefinition(bodyDef); creator.createIndex(); Marvin Humphrey Rectangular Research http://www.rectangular.com/ From mrvreddy at hotmail.com Tue Jun 6 13:04:30 2006 From: mrvreddy at hotmail.com (Raghuveer Mamilla) Date: Tue, 6 Jun 2006 19:04:30 +0200 Subject: [Ferret-talk] stack level too deep Message-ID: <66977560e9efaa88c6905b63fc5c506a@ruby-forum.com> Iam new to ferret Iam trying to do a sample application on ferret this is my code: class SearchController < ApplicationController require 'ferret' include Ferret index = Index::Index.new(:path => '/path/to/index') def list end def index index << {:title => "Programming Ruby", :content => "blah blah blah"} index << {:title => "Programming Ruby", :content => "yada yada yada"} count=index.search_each('content:"blah"') do |doc, score| puts "Document #{doc} found with a score of #{score}" end puts count end end This is wht i ended up with: http://localhost:3000/search/index SystemStackError in SearchController#index stack level too deep RAILS_ROOT: ./script/../config/.. Application Trace | Framework Trace | Full Trace #{RAILS_ROOT}/app/controllers/search_controller.rb:11:in `index' #{RAILS_ROOT}/app/controllers/search_controller.rb:11:in `index' . . . Reply please if some one know the answer -- Posted via http://www.ruby-forum.com/. From lmarlow at yahoo.com Tue Jun 6 13:11:09 2006 From: lmarlow at yahoo.com (Lee Marlow) Date: Tue, 6 Jun 2006 11:11:09 -0600 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> Message-ID: <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> Do you mean that all fields would have to be known at index creation time or just that once a field is defined it properties are the same across all documents? Right now I'm indexing documents that create new fields as needed based on user defined properties, so we don't know all the fields initially. On 6/5/06, Marvin Humphrey wrote: > > On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > > > In my experimental version of Ferret I have a fields > > file along with the segments file. The fields file stores all the > > field metadata such as store, index, term-vector and field boosts. > > That way there is no need to maintain a separate FieldInfos file per > > segment. (This will make merging a lot more difficult but I'm still > > thinking about that one.) > > Robert Kirchgessner made a similar proposal: > > http://xrl.us/m2qq (Link to mail-archives.apache.org) > > Robert addresses the merging issue in a subsequent email, and I think > his arguments are compelling. > > IMO, field defs should be immutable and consistent over the entire > index. > > >>> So here is a possible way example of the way I'd implement this; > >>> > >>> # the following might even look better in a YAML file. > >> > >> Ooo, nifty idea! How about a class whose sole purpose is to define > >> fields and generate the YAML file? Or, if we're thinking future > >> Lucene 2.1 file format, some Lucene-readable index definition file? > > > > Now this idea I like. Perhaps even a simple question/answer app to > > generate the index definition file. I'd guess that Lucene will > > probably end up going with XML rather than YAML. > > I think it would be a binary file, using Lucene's standard > writeString, writeVInt, etc. methods. > > A question/answer app could easily be built based around a module. > > How does "IndexCreator" sound? Take away the ability of the > IndexWriter module to create or redefine indexes, and encapsulate > that functionality within one module. Using Java as our lingua > franca... > > IndexCreator creator = new IndexCreator(filePath); > FieldDefinition titleDef = new FieldDefinition("title", > Field.Store.YES, Field.Index.TOKENIZED); > FieldDefinition bodyDef = new FieldDefinition("body", > Field.Store.YES, Field.Index.TOKENIZED > Field.TermVector.YES); > creator.addFieldDefinition(titleDef); > creator.addFieldDefinition(bodyDef); > creator.createIndex(); > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From marvin at rectangular.com Tue Jun 6 14:21:10 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Tue, 6 Jun 2006 11:21:10 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> Message-ID: <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote: > Do you mean that all fields would have to be known at index creation > time or just that once a field is defined it properties are the same > across all documents? Right now I'm indexing documents that create > new fields as needed based on user defined properties, so we don't > know all the fields initially. How would you handle this if you were using an SQL database rather than Ferret? Your app wouldn't be able to modify the table on the fly on that case, unless you did something insane like run a remote "ALTER TABLE" command. Marvin Humphrey Rectangular Research http://www.rectangular.com/ From JanPrill at blauton.de Tue Jun 6 14:37:41 2006 From: JanPrill at blauton.de (Jan Prill) Date: Tue, 6 Jun 2006 20:37:41 +0200 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> Message-ID: <562a35c10606061137m6f2325e2v923ad13547fb59c9@mail.gmail.com> Hi Marvin, this statement tempted me to jump in, even without using something like dynamic field creation myself __right now__. But I have been - especially on cms like projects badly in need for dynamic fields. That something isn't common in sql doesn't mean that there is no need for this "something". This limitation of sql is the reason for doing things like storing xml in relational dbs as well as the reason for people using object dbs. I don't know if you had a look at dabble db, but imagine something like this with a relational dbms. not funny! Because of this they haven't even thought about using sql for dabble db. So maybe it's just me but the argument: you can't do this in sql either doesn't sound too convincing... Cheers, Jan On 6/6/06, Marvin Humphrey wrote: > > > On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote: > > > Do you mean that all fields would have to be known at index creation > > time or just that once a field is defined it properties are the same > > across all documents? Right now I'm indexing documents that create > > new fields as needed based on user defined properties, so we don't > > know all the fields initially. > > How would you handle this if you were using an SQL database rather > than Ferret? Your app wouldn't be able to modify the table on the > fly on that case, unless you did something insane like run a remote > "ALTER TABLE" command. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060606/a20072fe/attachment.htm From M.B.Smillie at sms.ed.ac.uk Tue Jun 6 14:48:09 2006 From: M.B.Smillie at sms.ed.ac.uk (Matthew Smillie) Date: Tue, 6 Jun 2006 19:48:09 +0100 Subject: [Ferret-talk] stack level too deep In-Reply-To: <66977560e9efaa88c6905b63fc5c506a@ruby-forum.com> References: <66977560e9efaa88c6905b63fc5c506a@ruby-forum.com> Message-ID: <34F8BE13-F68B-45F7-8FAB-D8178B613566@sms.ed.ac.uk> On Jun 6, 2006, at 18:04, Raghuveer Mamilla wrote: > def index > > index << ... I haven't tried it out, but I believe that ruby is interpreting this as a recursive call to the index method. At least, that's what usually creates that sort of error. I'm not altogether familiar with rails conventions, but in the general case I would either make 'index' a class (@@index) or instance (@index) variable, which would avoid this problem. Is there some reason it's declared as local? matthew smillie. From eimorton at gmail.com Tue Jun 6 15:55:04 2006 From: eimorton at gmail.com (Erik) Date: Tue, 6 Jun 2006 21:55:04 +0200 Subject: [Ferret-talk] Ampersand Crashes Ruby Message-ID: <2820eef2669a4282422b1e98fdc5766b@ruby-forum.com> I'm using acts_as_ferret and when I call Object.find_by_contents("A & B"), Ruby dies with the following message: ^Cruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug ruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug ruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug ruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug [FATAL] failed to allocate memory Any thoughts? Regards, Erik -- Posted via http://www.ruby-forum.com/. From marvin at rectangular.com Tue Jun 6 17:07:53 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Tue, 6 Jun 2006 14:07:53 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <562a35c10606061137m6f2325e2v923ad13547fb59c9@mail.gmail.com> References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> <562a35c10606061137m6f2325e2v923ad13547fb59c9@mail.gmail.com> Message-ID: On Jun 6, 2006, at 11:37 AM, Jan Prill wrote: > this statement tempted me to jump in, even without using something > like dynamic field creation myself __right now__. But I have been - > especially on cms like projects badly in need for dynamic fields. > > That something isn't common in sql doesn't mean that there is no > need for this "something". This limitation of sql is the reason for > doing things like storing xml in relational dbs as well as the > reason for people using object dbs. I don't know if you had a look > at dabble db, but imagine something like this with a relational > dbms. not funny! Because of this they haven't even thought about > using sql for dabble db. So maybe it's just me but the argument: > you can't do this in sql either doesn't sound too convincing... Jan, I don't understand the requirement, and I'm not familiar with the either dabble db or Rails, so neither that example nor the "models" example Dave cited earlier has spoken to me. I asked the question because I honestly wanted to see a concrete example of an application that couldn't be handled within the constraint of pre- defined fields. Behind the scenes in Lucene is an elaborate, expensive apparatus for dealing with dynamic fields. Each document gets turned into its own miniature inverted index, complete with its own FieldInfos, FieldsWriter, DocumentWriter, TermInfosWriter, and so on. When these mini-indexes get merged, field definitions have to be reconciled. This merge stage is one of the bottlenecks which slow down interpreted-language ports of Lucene so severely, because there's a lot of object creation and destruction and a lot of method calls. KinoSearch uses a fixed-field-definition model. Before you add any documents to an index, you have to tell the index writer about all the possible fields you might use. When you add the first document, it creates the FieldInfos, FieldsWriter, etc, which persist throughout the life of the index writer. Instead of reconciling field definitions each time a document gets added, the field defs are defined as invariant for that indexing session. This is much faster, because there is far less object creation and destruction, and far less disk shuffling as well -- no segment merging, therefore no movement of stored fields, term vectors, etc. There are several possible ways to add dynamic fields back in to the fixed-field-def model. My main priority in doing so, if it proves to be necessary, is to keep table-alteration logic separate from insertion operations. Having the two conflated introduces needless complexity and computational expense at the back end. It's also just plain confusing -- if you accidentally forget to set OMIT_NORMS just once, all of a sudden that field is going to have norms for ever and ever amen. I think the user ought to have absolute control over field definitions. Inserting a field with a conflicting definition ought to be an error. Lucy is going to start with the KinoSearch merge model. I will do a better job of adding dynamic capabilities to it if you or someone else can articulate some specific examples of situations where static definitions would not suffice. I can think of a few tasks which would be slightly more convenient if new fields could be added on the fly, but maybe you can go one better and illustrate why dynamic field defs are essential. Marvin Humphrey Rectangular Research http://www.rectangular.com/ From Neville.Burnell at bmsoft.com.au Tue Jun 6 19:33:30 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Wed, 7 Jun 2006 09:33:30 +1000 Subject: [Ferret-talk] Proposal of some radical changes to API Message-ID: <126EC586577FD611A28E00A0C9A037587E6C02@maui.bmsoft.com.au> >> I asked the question because I honestly wanted to see a concrete >> example of an application that couldn't be handled within the >> constraint of pre- defined fields. My current application involves writing a web application which can seach a ferret index built from a SQL database. The idea is that the customer supplies SQLs for say customers, suppliers, sales and puchases etc. The app then retrieves the rows from the datasource and indexes using Ferret. The app provides both a html website as an interface to the index, and also an XML api which can be used by non browser clients. The field set is quite different for each SQL [and is essentially out of our control]. HTH, Neville -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Marvin Humphrey Sent: Wednesday, 7 June 2006 7:08 AM To: ferret-talk at rubyforge.org Subject: Re: [Ferret-talk] Proposal of some radical changes to API On Jun 6, 2006, at 11:37 AM, Jan Prill wrote: > this statement tempted me to jump in, even without using something > like dynamic field creation myself __right now__. But I have been - > especially on cms like projects badly in need for dynamic fields. > > That something isn't common in sql doesn't mean that there is no need > for this "something". This limitation of sql is the reason for doing > things like storing xml in relational dbs as well as the reason for > people using object dbs. I don't know if you had a look at dabble db, > but imagine something like this with a relational dbms. not funny! > Because of this they haven't even thought about using sql for dabble > db. So maybe it's just me but the argument: > you can't do this in sql either doesn't sound too convincing... Jan, I don't understand the requirement, and I'm not familiar with the either dabble db or Rails, so neither that example nor the "models" example Dave cited earlier has spoken to me. I asked the question because I honestly wanted to see a concrete example of an application that couldn't be handled within the constraint of pre- defined fields. Behind the scenes in Lucene is an elaborate, expensive apparatus for dealing with dynamic fields. Each document gets turned into its own miniature inverted index, complete with its own FieldInfos, FieldsWriter, DocumentWriter, TermInfosWriter, and so on. When these mini-indexes get merged, field definitions have to be reconciled. This merge stage is one of the bottlenecks which slow down interpreted-language ports of Lucene so severely, because there's a lot of object creation and destruction and a lot of method calls. KinoSearch uses a fixed-field-definition model. Before you add any documents to an index, you have to tell the index writer about all the possible fields you might use. When you add the first document, it creates the FieldInfos, FieldsWriter, etc, which persist throughout the life of the index writer. Instead of reconciling field definitions each time a document gets added, the field defs are defined as invariant for that indexing session. This is much faster, because there is far less object creation and destruction, and far less disk shuffling as well -- no segment merging, therefore no movement of stored fields, term vectors, etc. There are several possible ways to add dynamic fields back in to the fixed-field-def model. My main priority in doing so, if it proves to be necessary, is to keep table-alteration logic separate from insertion operations. Having the two conflated introduces needless complexity and computational expense at the back end. It's also just plain confusing -- if you accidentally forget to set OMIT_NORMS just once, all of a sudden that field is going to have norms for ever and ever amen. I think the user ought to have absolute control over field definitions. Inserting a field with a conflicting definition ought to be an error. Lucy is going to start with the KinoSearch merge model. I will do a better job of adding dynamic capabilities to it if you or someone else can articulate some specific examples of situations where static definitions would not suffice. I can think of a few tasks which would be slightly more convenient if new fields could be added on the fly, but maybe you can go one better and illustrate why dynamic field defs are essential. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk From Neville.Burnell at bmsoft.com.au Tue Jun 6 19:35:19 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Wed, 7 Jun 2006 09:35:19 +1000 Subject: [Ferret-talk] Ampersand Crashes Ruby Message-ID: <126EC586577FD611A28E00A0C9A037587E6C03@maui.bmsoft.com.au> Perhaps try "A && B" -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Erik Sent: Wednesday, 7 June 2006 5:55 AM To: ferret-talk at rubyforge.org Subject: [Ferret-talk] Ampersand Crashes Ruby I'm using acts_as_ferret and when I call Object.find_by_contents("A & B"), Ruby dies with the following message: ^Cruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug ruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug ruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug ruby(5014,0xa000cf60) malloc: *** vm_allocate(size=1069056) failed (error code=3) ruby(5014,0xa000cf60) malloc: *** error: can't allocate region ruby(5014,0xa000cf60) malloc: *** set a breakpoint in szone_error to debug [FATAL] failed to allocate memory Any thoughts? Regards, Erik -- Posted via http://www.ruby-forum.com/. _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk From dbalmain.ml at gmail.com Tue Jun 6 19:39:41 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 7 Jun 2006 08:39:41 +0900 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> Message-ID: On 6/7/06, Lee Marlow wrote: > Do you mean that all fields would have to be known at index creation > time or just that once a field is defined it properties are the same > across all documents? Right now I'm indexing documents that create > new fields as needed based on user defined properties, so we don't > know all the fields initially. Hi Lee, Dynamic fields will definitely be remaining in Ferret. But, as you said, once a field is defined its properties are set for all documents. So in your case, you would set the default properties for a field to match those that you use for your user defined field. Otherwise you could use Index#add_field() to add a field with whatever properties you need. This functionality is going to exist in Ferret but not necessarily in Lucy. Could you describe in more what kind of user defined properties you are indexing to help convince Marvin that dynamic fields are a good thing. Cheers, Dave > On 6/5/06, Marvin Humphrey wrote: > > > > On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > > > > > In my experimental version of Ferret I have a fields > > > file along with the segments file. The fields file stores all the > > > field metadata such as store, index, term-vector and field boosts. > > > That way there is no need to maintain a separate FieldInfos file per > > > segment. (This will make merging a lot more difficult but I'm still > > > thinking about that one.) > > > > Robert Kirchgessner made a similar proposal: > > > > http://xrl.us/m2qq (Link to mail-archives.apache.org) > > > > Robert addresses the merging issue in a subsequent email, and I think > > his arguments are compelling. > > > > IMO, field defs should be immutable and consistent over the entire > > index. > > > > >>> So here is a possible way example of the way I'd implement this; > > >>> > > >>> # the following might even look better in a YAML file. > > >> > > >> Ooo, nifty idea! How about a class whose sole purpose is to define > > >> fields and generate the YAML file? Or, if we're thinking future > > >> Lucene 2.1 file format, some Lucene-readable index definition file? > > > > > > Now this idea I like. Perhaps even a simple question/answer app to > > > generate the index definition file. I'd guess that Lucene will > > > probably end up going with XML rather than YAML. > > > > I think it would be a binary file, using Lucene's standard > > writeString, writeVInt, etc. methods. > > > > A question/answer app could easily be built based around a module. > > > > How does "IndexCreator" sound? Take away the ability of the > > IndexWriter module to create or redefine indexes, and encapsulate > > that functionality within one module. Using Java as our lingua > > franca... > > > > IndexCreator creator = new IndexCreator(filePath); > > FieldDefinition titleDef = new FieldDefinition("title", > > Field.Store.YES, Field.Index.TOKENIZED); > > FieldDefinition bodyDef = new FieldDefinition("body", > > Field.Store.YES, Field.Index.TOKENIZED > > Field.TermVector.YES); > > creator.addFieldDefinition(titleDef); > > creator.addFieldDefinition(bodyDef); > > creator.createIndex(); > > > > Marvin Humphrey > > Rectangular Research > > http://www.rectangular.com/ > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Tue Jun 6 21:08:43 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 7 Jun 2006 10:08:43 +0900 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> <562a35c10606061137m6f2325e2v923ad13547fb59c9@mail.gmail.com> Message-ID: On 6/7/06, Marvin Humphrey wrote: > > On Jun 6, 2006, at 11:37 AM, Jan Prill wrote: > > > this statement tempted me to jump in, even without using something > > like dynamic field creation myself __right now__. But I have been - > > especially on cms like projects badly in need for dynamic fields. > > > > That something isn't common in sql doesn't mean that there is no > > need for this "something". This limitation of sql is the reason for > > doing things like storing xml in relational dbs as well as the > > reason for people using object dbs. I don't know if you had a look > > at dabble db, but imagine something like this with a relational > > dbms. not funny! Because of this they haven't even thought about > > using sql for dabble db. So maybe it's just me but the argument: > > you can't do this in sql either doesn't sound too convincing... > > Jan, I don't understand the requirement, and I'm not familiar with > the either dabble db or Rails, so neither that example nor the > "models" example Dave cited earlier has spoken to me. I asked the > question because I honestly wanted to see a concrete example of an > application that couldn't be handled within the constraint of pre- > defined fields. > > Behind the scenes in Lucene is an elaborate, expensive apparatus for > dealing with dynamic fields. Each document gets turned into its own > miniature inverted index, complete with its own FieldInfos, > FieldsWriter, DocumentWriter, TermInfosWriter, and so on. When these > mini-indexes get merged, field definitions have to be reconciled. > This merge stage is one of the bottlenecks which slow down > interpreted-language ports of Lucene so severely, because there's a > lot of object creation and destruction and a lot of method calls. The way I'm dealing with this now is by having all the field definitions in a single file. When a field is defined it gets assigned a field number which is set for the life of the index. Hence, dynamic fields without the expense. > KinoSearch uses a fixed-field-definition model. Before you add any > documents to an index, you have to tell the index writer about all > the possible fields you might use. When you add the first document, > it creates the FieldInfos, FieldsWriter, etc, which persist > throughout the life of the index writer. Instead of reconciling > field definitions each time a document gets added, the field defs are > defined as invariant for that indexing session. This is much faster, > because there is far less object creation and destruction, and far > less disk shuffling as well -- no segment merging, therefore no > movement of stored fields, term vectors, etc. What happens when there are deletes? Which files should I look in to see how this works? I really need to get my head around the KinoSearch merge model. > There are several possible ways to add dynamic fields back in to the > fixed-field-def model. My main priority in doing so, if it proves to > be necessary, is to keep table-alteration logic separate from > insertion operations. Having the two conflated introduces needless > complexity and computational expense at the back end. It's also just > plain confusing -- if you accidentally forget to set OMIT_NORMS just > once, all of a sudden that field is going to have norms for ever and > ever amen. I think the user ought to have absolute control over > field definitions. Inserting a field with a conflicting definition > ought to be an error. I mostly agree but I don't think it is too expensive (computationally or with regard to complexity) to dynamically add unknown fields with default properties. > Lucy is going to start with the KinoSearch merge model. I will do a > better job of adding dynamic capabilities to it if you or someone > else can articulate some specific examples of situations where static > definitions would not suffice. I can think of a few tasks which > would be slightly more convenient if new fields could be added on the > fly, but maybe you can go one better and illustrate why dynamic field > defs are essential. Hopefully Lee will be able to describe his needs in a little more detail. I must admit that in most cases dynamic fields just make things a little easier, but you could do without them. Having said that I don't think Ferret would be a very ruby-like search library if it didn't allow dynamic fields. Ruby allows me to add methods not only to the core classes but also to already instantiated objects. Coming from a language that didn't allow you to do things like this, you'd probably think this feature is totally unnessecary. Earlier I said I'd be using Hashes as documents. Here is an example of how I could add lazy loading to documents in Ferret: def get_doc(doc_num) doc = {} class < References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> Message-ID: <7968d7490606062051i2c7ed5fatf2d5c410ac00aa78@mail.gmail.com> We index properties for products that vary from product to product. For instance, a shoe could have a color field with values of red, blue and green. It would also have a size field with 3,4,5,6,7,8,9,10 for values. Another product could be a car with transmission field with values automatic and manual. I index all the properties into their own field as well as dump them into another generic field for searching. In the database we have a property_types table where size, color, and transmission go. Then there is a many to many table from that to the products table that holds the acutal values of those properties (e.g. automatic, manual, red, green, 8, 9, etc.) I hope that helps explain it. -Lee On 6/6/06, Marvin Humphrey wrote: > > On Jun 6, 2006, at 10:11 AM, Lee Marlow wrote: > > > Do you mean that all fields would have to be known at index creation > > time or just that once a field is defined it properties are the same > > across all documents? Right now I'm indexing documents that create > > new fields as needed based on user defined properties, so we don't > > know all the fields initially. > > How would you handle this if you were using an SQL database rather > than Ferret? Your app wouldn't be able to modify the table on the > fly on that case, unless you did something insane like run a remote > "ALTER TABLE" command. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From marvin at rectangular.com Wed Jun 7 00:17:16 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Tue, 6 Jun 2006 21:17:16 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: References: <0DCD9B50-667A-49C4-974B-B1513B69BFED@rectangular.com> <7968d7490606061011g21fd6118i2065c5b6d0187c78@mail.gmail.com> <0C08F38B-E26B-42CF-9F30-6200FD122FA9@rectangular.com> <562a35c10606061137m6f2325e2v923ad13547fb59c9@mail.gmail.com> Message-ID: <4A8FDAD8-D186-4901-B323-6DDAEED96B90@rectangular.com> On Jun 6, 2006, at 6:08 PM, David Balmain wrote: > What happens when there are deletes? Which files should I look in to > see how this works? I really need to get my head around the KinoSearch > merge model. Let's say we're indexing a book. It has three pages. page 1 => "peas porridge hot" page 2 => "peas porridge cold" page 3 => "peas porridge in the pot, nine days old" Here's what Lucene does: First, create a mini-inverted index for each page... hot => 1 peas => 1 porridge => 1 cold => 2 peas => 2 porridge => 2 days => 3 in => 3 nine => 3 old => 3 peas => 3 porridge => 3 pot => 3 Then combine the indexes... cold => 2 days => 3 in => 3 hot => 1 nine => 3 old => 3 peas => 1, 2, 3 porridge => 1, 2, 3 pot => 3 ... and here's what KinoSearch does: First, dump everything into one giant pool... peas => 1 porridge => 1 hot => 1 peas => 2 porridge => 2 cold => 2 peas => 3 porridge => 3 in => 3 pot => 3 the => 3 nine => 3 days => 3 old => 3 ...then sort the whole thing in one go. Make sense? The big problem with the KinoSearch method is that you can't just keep dumping stuff into an array indefinitely -- you'll run out of memory, duh! So what you need is an object that looks like an array that you can keep dumping stuff into forever. Then you "sort" that "array". That's where the external sort algorithm comes in. The sortex object is basically a PriorityQueue of unlimited size, but which never occupies more than 20 or 30 megs of RAM because it periodically sorts and flushes its payload to disk. It recovers that stuff from disk later -- in sorted order -- when it's in fetching mode. If you want to spelunk KinoSearch to see how this happens, start with Invindexer::add_doc(). After some minor fiddling, it feeds SegWriter::add_doc(). SegWriter goes through each field, having TokenBatch invert the field's contents, feeding the inverted and serialized but unordered postings into PostingsWriter (which is where the external sort object lives), and writing the norms byte. Last, SegWriter hands the Doc object to FieldsWriter so that it can write the stored fields. The most important part of the previous chain is the step that never happened: nobody ever invoked SegmentMerger by calling the equivalent of Lucene's maybeMergeSegments(). There IS no SegmentMerger in KinoSearch. The rest of the process takes place when InvIndexer::finish() gets called. This time, InvIndexer has a lot to do. First, InvIndexer has to decide which segments need to be merged, if any, which it does using an algorithm based on the fibonacci series. If there are segments that need mergin', InvIndexer feeds each one of them to SegWriter::add_segment(). SegWriter has DelDocs generate a doc map which maps around deleted documents (just like Lucene). Next it has FieldInfos reconcile the field defs and create a field number map, which maps field numbers from the segment that's about to get merged away to field numbers for the current segment. SegWriter merges the norms itself. Then it calls FieldsWriter::add_segment(), which reads fields off disk (without decompressing compressed fields, or creating document objects, or doing anything important except mapping to new field numbers) and writes them into their new home in the current segment. Last, SegWriter arranges for PostingsWriter::add_segment to dump all the postings from the old segment into the current sort pool -- which *still* hasn't been sorted -- mapping to new field and document numbers as it goes. (Think of add_segment as add_doc on steroids.) Now that all documents and all merge-worthy segments have been processed, it's finally time to deal with the sort pool. InvIndexer calls SegWriter::finish(), which calls PostingsWriter::finish(). PostingsWriter::finish() does a little bit in Perl, then hands off to a heavy-duty C routine that goes through the sort pool one posting at a time, writing the .frq and .prx files itself, and feeding TermInfosWriter so that it can write the .tis and .tii files. SegWriter::finish() also invokes closing routines for the FieldsWriter, the norms filehandles, and so on. Last, it writes the compound file. (For simplicity's sake, and because there isn't much benefit to using the non-compound format under the KinoSearch merge model, KinoSearch always uses the compound format). Now that all the writing is complete, InvIndexer has to commit the changes by rewriting the 'segments' file. One interesting aspect of the KinoSearch merge model is that no matter how many documents you add or segments you merge, if the process gets interrupted at any time up till that single commit, the index remains unchanged. In KinoSearch, InvIndexer handles deletions too (IndexReader isn't even a public class), and deletions -- at least those deletions which affect segments that haven't been optimized away -- are committed during at the same moment. Deletable files are deleted if possible, the write lock is released... TADA! We're done. ... and since I spent so much time writing this up, I don't have time to respond to the other points. Check y'all later... Marvin Humphrey Rectangular Research http://www.rectangular.com/ From blee at alumni.caltech.edu Wed Jun 7 17:13:46 2006 From: blee at alumni.caltech.edu (Ben Lee) Date: Wed, 7 Jun 2006 23:13:46 +0200 Subject: [Ferret-talk] Acts_as_ferret - phrase query Message-ID: Sorry if this is another basic question, I was wondering if anyone could point me to an example (or provide) one of using acts_as_ferret with a more complicated query? My understanding is that by default, if I pass a string to find_by_contents, it's an AND'd keyword query. In particular, I'm interested in phrase queries. Thanks, Ben -- Posted via http://www.ruby-forum.com/. From blee at alumni.caltech.edu Wed Jun 7 17:32:05 2006 From: blee at alumni.caltech.edu (Ben Lee) Date: Wed, 7 Jun 2006 23:32:05 +0200 Subject: [Ferret-talk] Acts_as_ferret - phrase query In-Reply-To: References: Message-ID: <94f8c5388917478438cd5eb01f514906@ruby-forum.com> Nevermind... didn't get that they were just getting passed to the QueryParser. Ben Ben Lee wrote: > Sorry if this is another basic question, I was wondering if anyone could > point me to an example (or provide) one of using acts_as_ferret with a > more complicated query? My understanding is that by default, if I pass > a string to find_by_contents, it's an AND'd keyword query. In > particular, I'm interested in phrase queries. > > Thanks, > Ben -- Posted via http://www.ruby-forum.com/. From marvin at rectangular.com Thu Jun 8 01:06:36 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Wed, 7 Jun 2006 22:06:36 -0700 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <126EC586577FD611A28E00A0C9A037587E6C02@maui.bmsoft.com.au> References: <126EC586577FD611A28E00A0C9A037587E6C02@maui.bmsoft.com.au> Message-ID: <025EC0E9-48CE-4DBC-9196-D534B02C639D@rectangular.com> On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: >>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn't be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from > the datasource and indexes using Ferret. The app provides both a html > website as an interface to the index, and also an XML api which can be > used by non browser clients. > > The field set is quite different for each SQL [and is essentially > out of > our control]. So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that's not possible, and the fields may change up on each insert. In that case, under the interface I envision, you'd have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn't indexed before -- have you changed your mind? OK, I'll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it's not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you'd get an error if you ever tried to change the field def. Your app would be a little less elegant, it's true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave's polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object's properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don't agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I'd argue it's cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/ From dbalmain.ml at gmail.com Thu Jun 8 02:04:16 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 8 Jun 2006 15:04:16 +0900 Subject: [Ferret-talk] Proposal of some radical changes to API In-Reply-To: <025EC0E9-48CE-4DBC-9196-D534B02C639D@rectangular.com> References: <126EC586577FD611A28E00A0C9A037587E6C02@maui.bmsoft.com.au> <025EC0E9-48CE-4DBC-9196-D534B02C639D@rectangular.com> Message-ID: On 6/8/06, Marvin Humphrey wrote: > > On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: > > >>> I asked the question because I honestly wanted to see a concrete > >>> example of an application that couldn't be handled within the > >>> constraint of pre- defined fields. > > > > My current application involves writing a web application which can > > seach a ferret index built from a SQL database. > > > > The idea is that the customer supplies SQLs for say customers, > > suppliers, sales and puchases etc. The app then retrieves the rows > > from > > the datasource and indexes using Ferret. The app provides both a html > > website as an interface to the index, and also an XML api which can be > > used by non browser clients. > > > > The field set is quite different for each SQL [and is essentially > > out of > > our control]. > > So at what point does your app learn the structure of the SQL table? > Would it work if you were to start each session by telling the index > writer about the fields that were coming? > > def connect(field_names) > field_names.each do |field_name| > index.spec_field(field_name) # use default properties > end > end > > def add_to_index(submission) > index.add_hash_as_doc(submission) > end > > I can imagine a scenario where that's not possible, and the fields > may change up on each insert. In that case, under the interface I > envision, you'd have to do something like... > > def add_to_index(submission) > submission.each do |field_name, value| > index.spec_field(field_name) # use default properties > end > index.add_hash_as_doc(submission) > end > > FWIW, this stuff is happening anyway, behind the scenes. > Essentially, every time you add a field to an index, Ferret asks, > "Say, is this field indexed? And how about TermVectors, you want > those?" The 10_000th time you add the field, Ferret asks, "This > field wasn't indexed before -- have you changed your mind? OK, I'll > check back again later."... 1_000_000th doc: "You sure? How about I > make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" > > When it makes sense, of course you want to simplify the interface and > hide the complexity inside the library. However, given that it's not > possible to make coherent updates to existing data within a Lucene- > esque file format, my argument is that field definitions should never > change. So the repeated calls to spec_field above would be > completely redundant -- you'd get an error if you ever tried to > change the field def. > > Your app would be a little less elegant, it's true (performance > impact would be somewhere between insignificant and tiny unless you > had a zillion very short fields). However, I think the use case > where the fields are not known in advance is the exception rather > than the rule. > > It would also be possible to use Dave's polymorphic hash-as-doc > technique, where if the hash value is a Field object, you spec out > the field definition using that Field object's properties -- you > would just use full-on Field objects for each field. My argument > would be, again, that the field definitions should not change. If > you don't agree with that and the definition has to be modifiable > (within the current constraints), then that single-method technique > is probably better. However, if the definition is not modifiable, > then I'd argue it's cleaner to separate the two functions. I completely agree with you that field definitions should not change once they are set. However, I don't think having the library add missing fields with a default set of values (which would be set when you create the index) adds too much complexity. You simply need to check whether the field already exists. You already have to look up the field number anyway. So, to add dynamic fields, simply check to make sure a valid field number was found and add the field if it wasn't. Of course this is just as easy to implement in the binding code so I don't mind whether it gets into Lucy core or not. As long as you can add new fields to an index after documents have been added, I'm happy, and it seems from your example (nice ruby code by the way) that that is your plan. Dave From Neville.Burnell at bmsoft.com.au Thu Jun 8 03:03:14 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Thu, 8 Jun 2006 17:03:14 +1000 Subject: [Ferret-talk] Proposal of some radical changes to API Message-ID: <126EC586577FD611A28E00A0C9A037587E6C17@maui.bmsoft.com.au> >> So at what point does your app learn the structure of the SQL table? At the moment I know the structure after executing the SQL and fetching the first row [a ruby hash]. But the field set will change from SQL to SQL, and Ferret is doing all the field specification for me via hash-as-doc, ala. def create @index = Ferret::Index::Index.new() conn = ODBC.connect(@odbc[:dsn], @odbc[:uid], @odbc[:pwd]) @sqls.each do |sql| stmt = conn.prepare(sql) stmt.execute.each_hash{ |row| @index << row } stmt.close stmt.drop end conn.disconnect end The field definitions do not change though, so I'm happy as long as the hash-as-doc support remains in Ferret. Cheers, Neville -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Marvin Humphrey Sent: Thursday, 8 June 2006 3:07 PM To: ferret-talk at rubyforge.org Subject: Re: [Ferret-talk] Proposal of some radical changes to API On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: >>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn't be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from the datasource and indexes using Ferret. The app provides both a > html website as an interface to the index, and also an XML api which > can be used by non browser clients. > > The field set is quite different for each SQL [and is essentially out > of our control]. So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that's not possible, and the fields may change up on each insert. In that case, under the interface I envision, you'd have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn't indexed before -- have you changed your mind? OK, I'll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it's not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you'd get an error if you ever tried to change the field def. Your app would be a little less elegant, it's true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave's polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object's properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don't agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I'd argue it's cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk From michael at koziarski.com Sun Jun 11 06:08:02 2006 From: michael at koziarski.com (Michael Koziarski) Date: Sun, 11 Jun 2006 12:08:02 +0200 Subject: [Ferret-talk] Bus Error with Ferret 0.9.3 using the BooleanQuery api Message-ID: <590e5e196dc8248a49d05ee5e480ecf6@ruby-forum.com> Hey guys, I've been trying out ferret 0.9.3 on my powerbook this weekend and I've been triggering 'bus errors' when using the Query API. If I programmatically build up strings, it works just fine. There's some more information available in the trac ticket http://ferret.davebalmain.com/trac/ticket/62 Is anyone successfully using the Query API on mac os x? Anything I can do to help debug this? -- Cheers Koz -- Posted via http://www.ruby-forum.com/. From alex at blackkettle.org Mon Jun 12 07:07:16 2006 From: alex at blackkettle.org (Alex Young) Date: Mon, 12 Jun 2006 12:07:16 +0100 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... In-Reply-To: <4483EA12.3020003@blackkettle.org> References: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> <4483EA12.3020003@blackkettle.org> Message-ID: <448D4AE4.8040603@blackkettle.org> Alex Young wrote: > David Balmain wrote: >> On 6/5/06, Neville Burnell wrote: >> >>> >>> Hi and thanks for Ferret! >>> >>> I'm wondering if it would be possible to create a Ferret Win32 gem which >>> includes the c performance code pre-compiled for those of us without a C >>> compiler handy ? >> >> Unfortunately not yet. Alex Young may be working on it. The problem is >> that Ferret currently doesn't compile under Visual C 6 so there is >> some porting that needs to be done. > > Indeed I am. Or shall be this week. I'll be trying for something on > Friday at the latest, but I can't promise that what I come up with will > be useful to anyone but me... I will report back either way, though. > Bah. Last week ran away with me, and I didn't get a chance to finish off what I was doing in the end. Watch this space... -- Alex From crafterm at gmail.com Mon Jun 12 22:29:44 2006 From: crafterm at gmail.com (Marcus Crafter) Date: Tue, 13 Jun 2006 04:29:44 +0200 Subject: [Ferret-talk] Location of match? In-Reply-To: References: <8E50D95E-0C46-4901-B461-63A03AE6FE8C@likealightbulb.com> Message-ID: Hi David, David Balmain wrote: > A search result highlighter is coming in a future version of Ferret. > This will enable you to find the position of the match in a document. > I can't say when. This would be awesome and also what I'm looking for too - has there been any progress on this at all since you're last post we might be able to take a look at? Can we help in any way? Cheers, Marcus -- Posted via http://www.ruby-forum.com/. From crafterm at gmail.com Tue Jun 13 01:49:30 2006 From: crafterm at gmail.com (Marcus Crafter) Date: Tue, 13 Jun 2006 07:49:30 +0200 Subject: [Ferret-talk] Grep style output? Message-ID: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> Hi All, Hope all is going well. Was just wondering if anyone has implemented a grep style output page of hits using Ferret as the index/query engine? Any thoughts about how best to implement it? The previous thread discussess highlighting - would that be the best approach to follow or is there a better way? Cheers, Marcus -- Posted via http://www.ruby-forum.com/. From moedusa at gmail.com Tue Jun 13 08:41:41 2006 From: moedusa at gmail.com (Philipp Chudinov) Date: Tue, 13 Jun 2006 18:41:41 +0600 Subject: [Ferret-talk] Obtaining write lock when trying to write index error Message-ID: <6f5413930606130541h79e4f583w5f80f235f35c1583@mail.gmail.com> Haloo. I've noticed some weird behaviour while trying ferret with rails *without* act_as_ferret plugin: when I start application under lighttpd proxying requests to spawned fcgi processes, I see this: : Error occured at :703 Error: exception 6 not handled: Could not obtain write lock when trying to write index The same time everything goes ok with webrick. Indexing part for the model is as simple as def index index = get_index index << to_ferret_document # [RAILS_ROOT]/lib/searchable_model.rb:5:in `<<' index.flush index.optimize index.close end def get_index Ferret::Index::Index.new(:path => APP::CONFIG[:store_indexes_directory], :analyzer => FerretConfig::CONTENT_ANALYZER, :create_if_missing => true) end I supposed that working with index this way would be okay (according to http://ferret.davebalmain.com/trac/wiki/HowTos), but it looks like i'v missed some kind of top-secret ferret config parameter maybe or smth., so couldnt you share that trick with me ? (Iam sure every one knows that). Thanks. moe. From anatol.pomozov at gmail.com Tue Jun 13 08:50:10 2006 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Tue, 13 Jun 2006 14:50:10 +0200 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... In-Reply-To: <448D4AE4.8040603@blackkettle.org> References: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> <4483EA12.3020003@blackkettle.org> <448D4AE4.8040603@blackkettle.org> Message-ID: <0cc93aa5b293ee16fff9e44b63401803@ruby-forum.com> Alex Young wrote: > Alex Young wrote: >>> Unfortunately not yet. Alex Young may be working on it. The problem is >>> that Ferret currently doesn't compile under Visual C 6 so there is >>> some porting that needs to be done. >> >> Indeed I am. Or shall be this week. I'll be trying for something on >> Friday at the latest, but I can't promise that what I come up with will >> be useful to anyone but me... I will report back either way, though. >> > Bah. Last week ran away with me, and I didn't get a chance to finish > off what I was doing in the end. Watch this space... Probably compiling using MinGW would be choise?? AFAIK it is much more compatible with gcc (it uses ported gcc). I am not sure how it would work with ruby-core that compiled using MSVC, but.. have a try... -- Posted via http://www.ruby-forum.com/. From moedusa at gmail.com Tue Jun 13 09:08:19 2006 From: moedusa at gmail.com (Philipp Chudinov) Date: Tue, 13 Jun 2006 19:08:19 +0600 Subject: [Ferret-talk] Obtaining write lock when trying to write index error Message-ID: <6f5413930606130608r746bf77ib5fa904b3dee6501@mail.gmail.com> Well, the problem was that there was an old lock file in the index directory. The trick was to cleanup index from any locks if application crashes or smth. :) From alex at blackkettle.org Tue Jun 13 09:33:36 2006 From: alex at blackkettle.org (Alex Young) Date: Tue, 13 Jun 2006 14:33:36 +0100 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... In-Reply-To: <0cc93aa5b293ee16fff9e44b63401803@ruby-forum.com> References: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> <4483EA12.3020003@blackkettle.org> <448D4AE4.8040603@blackkettle.org> <0cc93aa5b293ee16fff9e44b63401803@ruby-forum.com> Message-ID: <448EBEB0.2010401@blackkettle.org> Anatol Pomozov wrote: > Alex Young wrote: > >>Alex Young wrote: >> >>>>Unfortunately not yet. Alex Young may be working on it. The problem is >>>>that Ferret currently doesn't compile under Visual C 6 so there is >>>>some porting that needs to be done. >>> >>>Indeed I am. Or shall be this week. I'll be trying for something on >>>Friday at the latest, but I can't promise that what I come up with will >>>be useful to anyone but me... I will report back either way, though. >>> >> >>Bah. Last week ran away with me, and I didn't get a chance to finish >>off what I was doing in the end. Watch this space... > > > Probably compiling using MinGW would be choise?? AFAIK it is much more > compatible with gcc (it uses ported gcc). I am not sure how it would > work with ruby-core that compiled using MSVC, but.. have a try... > That's the problem - Ferret compiled under MinGW isn't compatible with the OCI ruby. As it stands, I'm recompiling ruby under MinGW, and then attacking the extensions. The medium-term goal is to be able to give some pointers to Curt Hibbs et al (should they need them - I'm sure they've got it in hand) to be able to replace the MSVC build with a MinGW build in the One-Click Installer. -- Alex From crafterm at gmail.com Wed Jun 14 03:22:34 2006 From: crafterm at gmail.com (Marcus Crafter) Date: Wed, 14 Jun 2006 09:22:34 +0200 Subject: [Ferret-talk] In memory IndexReader bug? Message-ID: <71f43b0170e73d2fbb415d49d6cdcb11@ruby-forum.com> Hi All, Hope all is going well. I'm having trouble with the following code creating an in memory index reader - it seems to be attempting to read from a file regardless. Here's the simple code: require 'rubygems' require 'ferret' a = Ferret::Index::Index.new r = Ferret::Index::IndexReader.new(nil) Running the code on my OS X machine gives: marcus-crafters-powerbook-g4-17:/tmp crafterm$ ruby t.rb t.rb:5:in `initialize': : Error occured at :318 (Exception) Error: exception 2 not handled: Couldn't open the file to read from t.rb:5 The IndexReader API says pass nil in for an in memory directory, so I'm not sure what's wrong. Is this a bug - any ideas at all? This is ferret 0.9.3 for reference. Cheers, Marcus -- Posted via http://www.ruby-forum.com/. From shingler at gmail.com Thu Jun 15 11:28:44 2006 From: shingler at gmail.com (steven) Date: Thu, 15 Jun 2006 17:28:44 +0200 Subject: [Ferret-talk] best updating method Message-ID: <9725b8adcd24a46db77c015c9017cf77@ruby-forum.com> Hi All, I have a Ferret index containing some cached RSS feeds. I have a nightly cron script to cache the feeds, and I'd like to update the index with the latest feeds. I see the Index class has an update method, but I can't work out how to get the id of the relevant document to pass in. Lets say I have a file called "google_news.xml" I want to go: my_index.update(google_id, google_doc) I'm sure this is way too easy and I'm being massively dumb, but - - any hints/advice gratefully received. Many Thanks, Steven -- Posted via http://www.ruby-forum.com/. From jbensley.ng at gmail.com Thu Jun 15 11:36:28 2006 From: jbensley.ng at gmail.com (Jeremy Bensley) Date: Thu, 15 Jun 2006 10:36:28 -0500 Subject: [Ferret-talk] best updating method In-Reply-To: <9725b8adcd24a46db77c015c9017cf77@ruby-forum.com> References: <9725b8adcd24a46db77c015c9017cf77@ruby-forum.com> Message-ID: The way I usually handle updates like this is to store the filename in the index as a different field in the document. You can then search the index for that filename, get the index for that entry, and update accordingly. On 6/15/06, steven wrote: > > Hi All, > > I have a Ferret index containing some cached RSS feeds. > > I have a nightly cron script to cache the feeds, and I'd like to update > the index with the latest feeds. > > I see the Index class has an update method, but I can't work out how to > get the id of the relevant document to pass in. > > Lets say I have a file called "google_news.xml" > > I want to go: > my_index.update(google_id, google_doc) > > I'm sure this is way too easy and I'm being massively dumb, but - - any > hints/advice gratefully received. > > Many Thanks, > Steven > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060615/2f4f5e00/attachment.htm From sergei at redleafsoft.com Thu Jun 15 14:05:05 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Thu, 15 Jun 2006 20:05:05 +0200 Subject: [Ferret-talk] Ferret::Analysis::PerFieldAnalyzerWrapper is not exported Message-ID: <1c64d8a934570a7782a6418f8193ea6c@ruby-forum.com> Hi, I am on Ferret 0.9.3 and it seems to me that Ferret::Analysis::PerFieldAnalyzerWrapper is not available in ferret_ext. -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com -- Posted via http://www.ruby-forum.com/. From sergei at redleafsoft.com Thu Jun 15 16:17:44 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Thu, 15 Jun 2006 22:17:44 +0200 Subject: [Ferret-talk] Finding out all terms from search results. How? Message-ID: Hi everybody, I need to find out all terms (field values) from one of the fields from a set of documents returned by search. In other words, I have indexed documents with two fields. I do search on one field and then want to know all other field's values from fount documents. How? -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com -- Posted via http://www.ruby-forum.com/. From Neville.Burnell at bmsoft.com.au Thu Jun 15 19:48:49 2006 From: Neville.Burnell at bmsoft.com.au (Neville Burnell) Date: Fri, 16 Jun 2006 09:48:49 +1000 Subject: [Ferret-talk] Finding out all terms from search results. How? Message-ID: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> How about something like this, where "field2" is the field you want to collect values = [] index.search_each(query) do |doc, score| values.push index[doc]["field2"] end -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of Sergei Serdyuk Sent: Friday, 16 June 2006 6:18 AM To: ferret-talk at rubyforge.org Subject: [Ferret-talk] Finding out all terms from search results. How? Hi everybody, I need to find out all terms (field values) from one of the fields from a set of documents returned by search. In other words, I have indexed documents with two fields. I do search on one field and then want to know all other field's values from fount documents. How? -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com -- Posted via http://www.ruby-forum.com/. _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk From sergei at redleafsoft.com Fri Jun 16 09:59:35 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Fri, 16 Jun 2006 15:59:35 +0200 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> Message-ID: Hi Neville, It would work for a small resultset, but that is not an assumption I would want to make. I hope there is a way to get this info from Ferret directly. Sergei. Neville Burnell wrote: > How about something like this, where "field2" is the field you want to > collect > > values = [] > index.search_each(query) do |doc, score| > values.push index[doc]["field2"] > end -- Posted via http://www.ruby-forum.com/. From lmarlow at yahoo.com Fri Jun 16 10:46:05 2006 From: lmarlow at yahoo.com (Lee Marlow) Date: Fri, 16 Jun 2006 08:46:05 -0600 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> Message-ID: <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> Why would this only work for a small resultset? Are you looking for a list of terms from the other field as tokenized by ferret or for just the value you put in that field during indexing? -Lee On 6/16/06, Sergei Serdyuk wrote: > Hi Neville, > > It would work for a small resultset, but that is not an assumption I > would want to make. I hope there is a way to get this info from Ferret > directly. > > Sergei. > > > > > Neville Burnell wrote: > > How about something like this, where "field2" is the field you want to > > collect > > > > values = [] > > index.search_each(query) do |doc, score| > > values.push index[doc]["field2"] > > end > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From jbensley.ng at gmail.com Fri Jun 16 11:11:16 2006 From: jbensley.ng at gmail.com (Jeremy Bensley) Date: Fri, 16 Jun 2006 10:11:16 -0500 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> Message-ID: While I don't completely understand all contstraints, it seems as though a generalized version of Neville's solution that goes through all fields in the document would work just fine. i.e. fields = [] index.search_each(query) do |doc, score| fields += doc.all_fields end values = fields.collect { |f| f.string_value } I don't really know what part of 'Ferret doing this' would be ... the information would have to be stored and retrieved from the index. Please elaborate if we do not seem to completely understand the problem. On 6/16/06, Lee Marlow wrote: > > Why would this only work for a small resultset? Are you looking for a > list of terms from the other field as tokenized by ferret or for just > the value you put in that field during indexing? > > -Lee > > On 6/16/06, Sergei Serdyuk wrote: > > Hi Neville, > > > > It would work for a small resultset, but that is not an assumption I > > would want to make. I hope there is a way to get this info from Ferret > > directly. > > > > Sergei. > > > > > > > > > > Neville Burnell wrote: > > > How about something like this, where "field2" is the field you want to > > > collect > > > > > > values = [] > > > index.search_each(query) do |doc, score| > > > values.push index[doc]["field2"] > > > end > > > > > > -- > > Posted via http://www.ruby-forum.com/. > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060616/8cbefdeb/attachment.htm From sergei at redleafsoft.com Fri Jun 16 12:45:23 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Fri, 16 Jun 2006 18:45:23 +0200 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> Message-ID: <81cb07fc49a20f16ab846c3a297e6b7b@ruby-forum.com> Let me illustrate my problem a bit more. There is an index with 1.2M books in it. Every book has category field and every book can be currently in stock, which is stored in stock field. Now, I generally expect to have 50-60% of books to be stocked. So it leaves me with 600,000 books I would need to iterate to find out what categories are currently stocked. It sounds like borderline task where one would think a database would be more appropriate, but ability to do advanced search over this collection of books is a top priority and database would not provide that. -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com -- Posted via http://www.ruby-forum.com/. From sergei at redleafsoft.com Fri Jun 16 12:51:09 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Fri, 16 Jun 2006 18:51:09 +0200 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> Message-ID: I would think that it can provide a set of terms that are connected to a set of documents without pulling out those documents one by one. -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com > Jeremy Bensley wrote: > I don't really know what part of 'Ferret doing this' would be ... the > information would have to be stored and retrieved from the index. Please > elaborate if we do not seem to completely understand the problem. -- Posted via http://www.ruby-forum.com/. From erik at ehatchersolutions.com Fri Jun 16 12:54:38 2006 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Fri, 16 Jun 2006 12:54:38 -0400 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: <81cb07fc49a20f16ab846c3a297e6b7b@ruby-forum.com> References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> <81cb07fc49a20f16ab846c3a297e6b7b@ruby-forum.com> Message-ID: <54D3B299-8C98-413C-98D6-70FF7CDDD257@ehatchersolutions.com> I'm not familiar enough with Ferret, but I do this sort filtering and set intersections with Java Lucene, primarily using Solr, from a Ruby on Rails front-end. I build up bit sets (using Solr's new OpenBitSet class) that represent "all items collected" and apply that filter to searches and also intersect (using bit set ANDing) with other sets such as "all objects from 1861" and "all poetry genre objects", and so on. I've also customized Solr to return back facet counts, so given your example it could show how many books were in stock in each category and allow you to filter to see all those books easily too. Using these types of set intersection operations even bypasses the traditional Lucene search by simply dealing with efficiently structure sets of document id's. Erik On Jun 16, 2006, at 12:45 PM, Sergei Serdyuk wrote: > Let me illustrate my problem a bit more. > > There is an index with 1.2M books in it. Every book has category field > and every book can be currently in stock, which is stored in stock > field. Now, I generally expect to have 50-60% of books to be > stocked. So > it leaves me with 600,000 books I would need to iterate to find out > what > categories are currently stocked. > > It sounds like borderline task where one would think a database > would be > more appropriate, but ability to do advanced search over this > collection > of books is a top priority and database would not provide that. > > -- > Sergei Serdyuk > Red Leaf Software LLC > web: http://redleafsoft.com > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From sergei at redleafsoft.com Fri Jun 16 15:03:26 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Fri, 16 Jun 2006 21:03:26 +0200 Subject: [Ferret-talk] best updating method In-Reply-To: <9725b8adcd24a46db77c015c9017cf77@ruby-forum.com> References: <9725b8adcd24a46db77c015c9017cf77@ruby-forum.com> Message-ID: <02ab204b28868f301019b4b6614e3b4e@ruby-forum.com> In this case sounds like RSS feed URL is your natural primary key. You could add untokenized 'id' field to your documents and then retrieve and update them by using URLs as keys. And you could even have a more natural field name if you create index with some optional params. Example: url = 'http://feeds.feedburner.com/RidingRails' index = Ferret::Index::Index.new(:path => "#{RAILS_ROOT}/db/ferret", :id_field => 'url') document = Ferret::Document::Document.new document << Ferret::Document::Field.new('url', url, Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::UNTOKENIZED) document << Ferret::Document::Field.new('content', 'Rails are great!', Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::TOKENIZED) index << document document = index[url] puts document['url'] == url # true document['content'] = 'I agree' index.update(url, document) index[url]['content'] == I agree # true index.size == 1 # true -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com > Hi All, > > I have a Ferret index containing some cached RSS feeds. > > I have a nightly cron script to cache the feeds, and I'd like to update > the index with the latest feeds. > > I see the Index class has an update method, but I can't work out how to > get the id of the relevant document to pass in. > > Lets say I have a file called "google_news.xml" > > I want to go: > my_index.update(google_id, google_doc) > > I'm sure this is way too easy and I'm being massively dumb, but - - any > hints/advice gratefully received. > > Many Thanks, > Steven -- Posted via http://www.ruby-forum.com/. From sergei at redleafsoft.com Fri Jun 16 15:08:07 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Fri, 16 Jun 2006 21:08:07 +0200 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: <54D3B299-8C98-413C-98D6-70FF7CDDD257@ehatchersolutions.com> References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> <81cb07fc49a20f16ab846c3a297e6b7b@ruby-forum.com> <54D3B299-8C98-413C-98D6-70FF7CDDD257@ehatchersolutions.com> Message-ID: <0f7b55db5e2cc5f5fd0792f79f773cc9@ruby-forum.com> Thank you Erik. It is not clear to me what it would look like in Ferret, but it sounds like a good direction to dig in. > Erik Hatcher wrote: > I'm not familiar enough with Ferret, but I do this sort filtering and > set intersections with Java Lucene, primarily using Solr, from a Ruby > on Rails front-end. > > I build up bit sets (using Solr's new OpenBitSet class) that > represent "all items collected" and apply that filter to searches and > also intersect (using bit set ANDing) with other sets such as "all > objects from 1861" and "all poetry genre objects", and so on. I've > also customized Solr to return back facet counts, so given your > example it could show how many books were in stock in each category > and allow you to filter to see all those books easily too. Using > these types of set intersection operations even bypasses the > traditional Lucene search by simply dealing with efficiently > structure sets of document id's. > > Erik -- Posted via http://www.ruby-forum.com/. From erik at ehatchersolutions.com Fri Jun 16 16:45:28 2006 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Fri, 16 Jun 2006 16:45:28 -0400 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: <0f7b55db5e2cc5f5fd0792f79f773cc9@ruby-forum.com> References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> <81cb07fc49a20f16ab846c3a297e6b7b@ruby-forum.com> <54D3B299-8C98-413C-98D6-70FF7CDDD257@ehatchersolutions.com> <0f7b55db5e2cc5f5fd0792f79f773cc9@ruby-forum.com> Message-ID: <4BA63E76-FBB2-42E1-B90C-06379A7E75C1@ehatchersolutions.com> On Jun 16, 2006, at 3:08 PM, Sergei Serdyuk wrote: > Thank you Erik. It is not clear to me what it would look like in > Ferret, > but it sounds like a good direction to dig in. In Java, building up such filters is done with code like this: TermEnum termEnum = reader.terms(new Term(field, "")); while (true) { Term term = termEnum.term(); if (term == null || !term.field().equals(field)) break; termDocs.seek(term); OpenBitSet bitSet = new OpenBitSet(reader.numDocs()); while (termDocs.next()) { bitSet.set(termDocs.doc()); } // ... cache bitSet for future use ... if (! termEnum.next()) break; } Ferret has a comparable API underneath that should make this sort of thing feasible in pure Ruby somehow. Erik > >> Erik Hatcher wrote: >> I'm not familiar enough with Ferret, but I do this sort filtering and >> set intersections with Java Lucene, primarily using Solr, from a Ruby >> on Rails front-end. >> >> I build up bit sets (using Solr's new OpenBitSet class) that >> represent "all items collected" and apply that filter to searches and >> also intersect (using bit set ANDing) with other sets such as "all >> objects from 1861" and "all poetry genre objects", and so on. I've >> also customized Solr to return back facet counts, so given your >> example it could show how many books were in stock in each category >> and allow you to filter to see all those books easily too. Using >> these types of set intersection operations even bypasses the >> traditional Lucene search by simply dealing with efficiently >> structure sets of document id's. >> >> Erik > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From justin.kan at gmail.com Fri Jun 16 19:14:38 2006 From: justin.kan at gmail.com (Justin Kan) Date: Fri, 16 Jun 2006 19:14:38 -0400 Subject: [Ferret-talk] indexing large tokens Message-ID: Hi, I'm using the StandardAnalyzer to build an index, and passing in Documents that have Fields that contain large tokens (22+ characters) interpersed with normal English words. This seems to cause the IndexWriter to slow to a crawl. Is this a known issue, or am I doing something wrong? If this is a known issue I don't have any problem just not indexing tokens longer than a certain length, but what's the best way to eliminate them? Using a TokenFilter on my own Analyzer? Sorry for the newbish questions, I'm new to ferret having never used lucene. Thanks in advance, Justin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060616/7c7f8ae0/attachment-0001.htm From dbalmain.ml at gmail.com Fri Jun 16 20:07:09 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 17 Jun 2006 09:07:09 +0900 Subject: [Ferret-talk] indexing large tokens In-Reply-To: References: Message-ID: On 6/17/06, Justin Kan wrote: > Hi, > > I'm using the StandardAnalyzer to build an index, and passing in Documents > that have Fields that contain large tokens (22+ characters) interpersed with > normal English words. This seems to cause the IndexWriter to slow to a > crawl. Is this a known issue, or am I doing something wrong? Hi Justin, I haven't come accross this problem? Are you on Windows by any chance? Currently Ferret is just generally slow on Windows because it is pure Ruby code. One problem large tokens may cause is the general increase in the number of terms in the index which can slow down indexing a little but it would surprise me if it was making a huge difference unless there was a particularly large number of them. > If this is a known issue I don't have any problem just not indexing tokens > longer than a certain length, but what's the best way to eliminate them? > Using a TokenFilter on my own Analyzer? Sorry for the newbish questions, I'm > new to ferret having never used lucene. Thanks in advance, Yes, using a token filter will do the job. Have a look in the analysis module of Ferret for some examples. I'd be interested to hear if it makes any difference. Cheers, Dave From dbalmain.ml at gmail.com Fri Jun 16 20:27:18 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 17 Jun 2006 09:27:18 +0900 Subject: [Ferret-talk] Finding out all terms from search results. How? In-Reply-To: <4BA63E76-FBB2-42E1-B90C-06379A7E75C1@ehatchersolutions.com> References: <126EC586577FD611A28E00A0C9A037587E6C5C@maui.bmsoft.com.au> <7968d7490606160746h6a0150d8ofcc4ee1e22f3107@mail.gmail.com> <81cb07fc49a20f16ab846c3a297e6b7b@ruby-forum.com> <54D3B299-8C98-413C-98D6-70FF7CDDD257@ehatchersolutions.com> <0f7b55db5e2cc5f5fd0792f79f773cc9@ruby-forum.com> <4BA63E76-FBB2-42E1-B90C-06379A7E75C1@ehatchersolutions.com> Message-ID: On 6/17/06, Erik Hatcher wrote: > On Jun 16, 2006, at 3:08 PM, Sergei Serdyuk wrote: > > Thank you Erik. It is not clear to me what it would look like in > > Ferret, > > but it sounds like a good direction to dig in. > > In Java, building up such filters is done with code like this: > > TermEnum termEnum = reader.terms(new Term(field, "")); > while (true) { > Term term = termEnum.term(); > if (term == null || !term.field().equals(field)) break; > > termDocs.seek(term); > OpenBitSet bitSet = new OpenBitSet(reader.numDocs()); > while (termDocs.next()) { > bitSet.set(termDocs.doc()); > } > > // ... cache bitSet for future use ... > > if (! termEnum.next()) break; > } > > Ferret has a comparable API underneath that should make this sort of > thing feasible in pure Ruby somehow. It is similar in Ferret. Have a look here to see the solution to a similar problem; http://www.ruby-forum.com/topic/56232#40931 Hope that helps. Cheers, Dave From dbalmain.ml at gmail.com Fri Jun 16 20:40:13 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 17 Jun 2006 09:40:13 +0900 Subject: [Ferret-talk] Ferret::Analysis::PerFieldAnalyzerWrapper is not exported In-Reply-To: <1c64d8a934570a7782a6418f8193ea6c@ruby-forum.com> References: <1c64d8a934570a7782a6418f8193ea6c@ruby-forum.com> Message-ID: On 6/16/06, Sergei Serdyuk wrote: > Hi, > > I am on Ferret 0.9.3 and it seems to me that > Ferret::Analysis::PerFieldAnalyzerWrapper is not available in > ferret_ext. Sorry, this was a naming error. In 0.9.3 it is called PerFieldAnalyzer. This has been fixed in subversion so that both both class names will work. From dbalmain.ml at gmail.com Fri Jun 16 21:15:58 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 17 Jun 2006 10:15:58 +0900 Subject: [Ferret-talk] In memory IndexReader bug? In-Reply-To: <71f43b0170e73d2fbb415d49d6cdcb11@ruby-forum.com> References: <71f43b0170e73d2fbb415d49d6cdcb11@ruby-forum.com> Message-ID: On 6/14/06, Marcus Crafter wrote: > Hi All, > > Hope all is going well. > > I'm having trouble with the following code creating an in memory index > reader - it seems to be attempting to read from a file regardless. > Here's the simple code: > > require 'rubygems' > require 'ferret' > > a = Ferret::Index::Index.new > r = Ferret::Index::IndexReader.new(nil) > > > Running the code on my OS X machine gives: > > marcus-crafters-powerbook-g4-17:/tmp crafterm$ ruby t.rb > t.rb:5:in `initialize': : Error occured at :318 (Exception) > Error: exception 2 not handled: Couldn't open the file to read > from t.rb:5 > > The IndexReader API says pass nil in for an in memory directory, so I'm > not sure what's wrong. > > Is this a bug - any ideas at all? This is ferret 0.9.3 for reference. Hi Marcus, Sorry, this is a mistake in the docs. It doesn't make sense to open an IndexReader with an anonymous RAMDirectory as it obviously won't contain any index yet. The problem is that the IndexReader is trying to read the segments file which it expects to be there but, since no index has been written, there is no segments file. If you pass a RAMDirectory that actually contains an index written by an IndexWriter or Index class then it should work. Cheers, Dave > Cheers, > > Marcus > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Fri Jun 16 21:26:15 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 17 Jun 2006 10:26:15 +0900 Subject: [Ferret-talk] Grep style output? In-Reply-To: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> References: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> Message-ID: On 6/13/06, Marcus Crafter wrote: > Hi All, > > Hope all is going well. Was just wondering if anyone has implemented a > grep style output page of hits using Ferret as the index/query engine? > > Any thoughts about how best to implement it? The previous thread > discussess highlighting - would that be the best approach to follow or > is there a better way? > > Cheers, > > Marcus Hi Marcus, If you can read java the best way would be to check out the highlighter in Apache Lucene and porting that code to Ruby. You can see the highlighter module here; http://svn.apache.org/viewvc/lucene/java/trunk/contrib/ I'm going to do this myself eventually but you'll have to do it yourself if you need it soon. Before you put too much work into it though, be warned that there are possible major Ferret API changes ahead. Cheers, Dave From dbalmain.ml at gmail.com Fri Jun 16 21:31:51 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 17 Jun 2006 10:31:51 +0900 Subject: [Ferret-talk] Bus Error with Ferret 0.9.3 using the BooleanQuery api In-Reply-To: <590e5e196dc8248a49d05ee5e480ecf6@ruby-forum.com> References: <590e5e196dc8248a49d05ee5e480ecf6@ruby-forum.com> Message-ID: On 6/11/06, Michael Koziarski wrote: > Hey guys, > > I've been trying out ferret 0.9.3 on my powerbook this weekend and I've > been triggering 'bus errors' when using the Query API. If I > programmatically build up strings, it works just fine. > > There's some more information available in the trac ticket > > http://ferret.davebalmain.com/trac/ticket/62 > > Is anyone successfully using the Query API on mac os x? Anything I can > do to help debug this? Hi Michael, I noticed in your code you are using BooleanQuery#add_clause method but you are adding a query. Can you try BooleanQuery#add_query instead? Let me know if that helps. Cheers, Dave From jmcgrath at fryolator.com Sat Jun 17 17:48:23 2006 From: jmcgrath at fryolator.com (John McGrath) Date: Sat, 17 Jun 2006 17:48:23 -0400 Subject: [Ferret-talk] preventing indexing of an acts_as_ferret'd model? Message-ID: <1150580903.449478a73aec4@webmail.whoi.edu> Hi, this is hopefully an easy one, but I've gone through the api and searched past forum entries, and am drawing a blank. I have a model that with acts_as_ferret mixed in to it, which is working fine. But I want users to be able to set a 'private' attribute on the model, and when it's set to true, create and update methods would skip indexing. So, how can I prevent indexing of a usuaully-indexed model? Any help greatly appreciated. John ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From JanPrill at blauton.de Sun Jun 18 03:23:11 2006 From: JanPrill at blauton.de (Jan Prill) Date: Sun, 18 Jun 2006 09:23:11 +0200 Subject: [Ferret-talk] preventing indexing of an acts_as_ferret'd model? In-Reply-To: <562a35c10606180020v5f1e051cq45387dd333d6e266@mail.gmail.com> References: <1150580903.449478a73aec4@webmail.whoi.edu> <562a35c10606180020v5f1e051cq45387dd333d6e266@mail.gmail.com> Message-ID: <562a35c10606180023r2de756cfue3f33416a57125f6@mail.gmail.com> Hi, John, this might be wrong since I'm no expert in acts_as_ferret. But I've scrolled through http://projects.jkraemer.net/acts_as_ferret/browser/trunk/plugin/acts_as_ferret/lib/acts_as_ferret.rb and maybe it's only me and you but I don't see a "keep_private" kind of flag in acts_as_ferret.rb either. So IMHO you've got two choices: 1. Without changes on acts_as_ferret.rb you might add a field "keep_private" to your model which you set to the user_id (or even better a salted hash) of your users if they set it to true or leave it to the default '0'. Then in your queries you issue a query that returns only results where keep_private is set to '0' OR your user_id (or the salted hash). This should return only documents that are either public or created by the current user. 2. If this is too insecure in your opinion (because you don't trust that no documents would leak out) and you definitly want to keep private documents from being indexed (which of course means that they aren't searchable for the original creator either) then IMHO your best bet would be to fiddle around with acts_as_ferret.rb yourself (it's quite good documented) or overwrite it's instance methods in your model. You might then add a small check on the ferret_create for example if the model is private and therefore needs to skip the indexing. Cheers, Jan > > > On 6/17/06, John McGrath wrote: > > > > Hi, this is hopefully an easy one, but I've gone through the api and > > searched > > past forum entries, and am drawing a blank. > > > > I have a model that with acts_as_ferret mixed in to it, which is working > > fine. > > But I want users to be able to set a 'private' attribute on the model, > > and when > > it's set to true, create and update methods would skip indexing. So, how > > can I > > prevent indexing of a usuaully-indexed model? > > > > Any help greatly appreciated. > > > > John > > > > ---------------------------------------------------------------- > > This message was sent using IMP, the Internet Messaging Program. > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060618/683f6df4/attachment.htm From jmcgrath at fryolator.com Sun Jun 18 04:59:10 2006 From: jmcgrath at fryolator.com (John McGrath) Date: Sun, 18 Jun 2006 04:59:10 -0400 Subject: [Ferret-talk] preventing indexing of an acts_as_ferret'd model? In-Reply-To: <562a35c10606180023r2de756cfue3f33416a57125f6@mail.gmail.com> References: <1150580903.449478a73aec4@webmail.whoi.edu> <562a35c10606180020v5f1e051cq45387dd333d6e266@mail.gmail.com> <562a35c10606180023r2de756cfue3f33416a57125f6@mail.gmail.com> Message-ID: <1150621150.449515de9f719@webmail.whoi.edu> thanks jan, both good suggestions. when time permits i might try the second. for now i'm using a ridiculous hack, but it's working: i let the create method do it's thing, then if the 'private' flag is set, i call ferret_destroy for it. a lot of useless overhead, to add to the index then immediately remove from it, but since this db is read much more than it's written to, this will at least hold us over for a while. best, john Quoting Jan Prill : > Hi, John, > > this might be wrong since I'm no expert in acts_as_ferret. But I've scrolled > through > http://projects.jkraemer.net/acts_as_ferret/browser/trunk/plugin/acts_as_ferret/lib/acts_as_ferret.rb > and maybe it's only me and you but I don't see a "keep_private" kind of flag > in acts_as_ferret.rb either. > > So IMHO you've got two choices: > > 1. Without changes on acts_as_ferret.rb you might add a field "keep_private" > to your model which you set to the user_id (or even better a salted hash) of > your users if they set it to true or leave it to the default '0'. Then in > your queries you issue a query that returns only results where keep_private > is set to '0' OR your user_id (or the salted hash). This should return only > documents that are either public or created by the current user. > > 2. If this is too insecure in your opinion (because you don't trust that no > documents would leak out) and you definitly want to keep private documents > from being indexed (which of course means that they aren't searchable for > the original creator either) then IMHO your best bet would be to fiddle > around with acts_as_ferret.rb yourself (it's quite good documented) or > overwrite it's instance methods in your model. You might then add a small > check on the ferret_create for example if the model is private and therefore > needs to skip the indexing. > > Cheers, > Jan > > > > > > > On 6/17/06, John McGrath wrote: > > > > > > Hi, this is hopefully an easy one, but I've gone through the api and > > > searched > > > past forum entries, and am drawing a blank. > > > > > > I have a model that with acts_as_ferret mixed in to it, which is working > > > fine. > > > But I want users to be able to set a 'private' attribute on the model, > > > and when > > > it's set to true, create and update methods would skip indexing. So, how > > > can I > > > prevent indexing of a usuaully-indexed model? > > > > > > Any help greatly appreciated. > > > > > > John > > > > > > ---------------------------------------------------------------- > > > This message was sent using IMP, the Internet Messaging Program. > > > > > > _______________________________________________ > > > Ferret-talk mailing list > > > Ferret-talk at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > > > > > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From justin.kan at gmail.com Mon Jun 19 13:53:21 2006 From: justin.kan at gmail.com (Justin Kan) Date: Mon, 19 Jun 2006 13:53:21 -0400 Subject: [Ferret-talk] indexing large tokens In-Reply-To: References: Message-ID: David, I was running on Windows, and when I moved to linux the problem disappeared (I'm assuming because linux automatically uses cferret?). Thanks for the help! Justin On 6/16/06, David Balmain < dbalmain.ml at gmail.com> wrote: > > On 6/17/06, Justin Kan < justin.kan at gmail.com> wrote: > > Hi, > > > > I'm using the StandardAnalyzer to build an index, and passing in > Documents > > that have Fields that contain large tokens (22+ characters) interpersed > with > > normal English words. This seems to cause the IndexWriter to slow to a > > crawl. Is this a known issue, or am I doing something wrong? > > Hi Justin, > > I haven't come accross this problem? Are you on Windows by any chance? > Currently Ferret is just generally slow on Windows because it is pure > Ruby code. One problem large tokens may cause is the general increase > in the number of terms in the index which can slow down indexing a > little but it would surprise me if it was making a huge difference > unless there was a particularly large number of them. > > > If this is a known issue I don't have any problem just not indexing > tokens > > longer than a certain length, but what's the best way to eliminate them? > > > Using a TokenFilter on my own Analyzer? Sorry for the newbish questions, > I'm > > new to ferret having never used lucene. Thanks in advance, > > Yes, using a token filter will do the job. Have a look in the analysis > module of Ferret for some examples. I'd be interested to hear if it > makes any difference. > > Cheers, > Dave > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060619/f2f7932c/attachment.htm From garypelliott at gmail.com Mon Jun 19 14:03:32 2006 From: garypelliott at gmail.com (Gary Elliott) Date: Mon, 19 Jun 2006 14:03:32 -0400 Subject: [Ferret-talk] Ferret on a production site Message-ID: Over the weekend we launched http://tourb.us, a social search site for live music that makes heavy use of ferret. We're using version 0.3.2 and have had very few problems so far. You can read about some of our ferret usage here: http://blog.tourb.us/archives/ferret-and-location-based-searches I'd be very interested in hearing from other people who have ferret running on a production site. How big are your indexes? Have you run into any performance problems? Thanks for some great software, David. -gary From shingler at gmail.com Tue Jun 20 13:20:46 2006 From: shingler at gmail.com (Steven Shingler) Date: Tue, 20 Jun 2006 19:20:46 +0200 Subject: [Ferret-talk] best updating method In-Reply-To: <02ab204b28868f301019b4b6614e3b4e@ruby-forum.com> References: <9725b8adcd24a46db77c015c9017cf77@ruby-forum.com> <02ab204b28868f301019b4b6614e3b4e@ruby-forum.com> Message-ID: <25ee47703a369bb25792b3ec05baa69f@ruby-forum.com> Many Thanks - very helpful :) -- Posted via http://www.ruby-forum.com/. From sergei at redleafsoft.com Tue Jun 20 11:34:13 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Tue, 20 Jun 2006 17:34:13 +0200 Subject: [Ferret-talk] Any fast way to update non-indexed fields? Message-ID: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> Hi, >From looking at Ruby sources it seems that every update method deletes and reinserts documents. It makes sense if indexed fields are changed but what if it is not the case? It would speed up update a lot indexes did not have to be updated twice for nothing. Any quick way to do it? -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com -- Posted via http://www.ruby-forum.com/. From ryan at theryanking.com Tue Jun 20 14:09:30 2006 From: ryan at theryanking.com (ryan king) Date: Tue, 20 Jun 2006 20:09:30 +0200 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> Message-ID: Sergei Serdyuk wrote: > Hi, > > From looking at Ruby sources it seems that every update method deletes > and reinserts documents. It makes sense if indexed fields are changed > but what if it is not the case? It would speed up update a lot indexes > did not have to be updated twice for nothing. Any quick way to do it? I'm not an expert with Lucene, but I believe that's how Lucene indexes work - there is no update, only create and delete. -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Tue Jun 20 21:30:00 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 21 Jun 2006 10:30:00 +0900 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> Message-ID: On 6/21/06, ryan king wrote: > Sergei Serdyuk wrote: > > Hi, > > > > From looking at Ruby sources it seems that every update method deletes > > and reinserts documents. It makes sense if indexed fields are changed > > but what if it is not the case? It would speed up update a lot indexes > > did not have to be updated twice for nothing. Any quick way to do it? > > I'm not an expert with Lucene, but I believe that's how Lucene indexes > work - there is no update, only create and delete. It is in fact the way Lucene works. The main problem with the update method in Ferret is that for each update it needs to open an IndexReader to read and delete the old doc, then close it and open and IndexWriter to open a new doc. In the version of Ferret I'm working on now you'll be able to do updates directly on the IndexWriter so it should be a lot faster. As for just updating the stored-unindexed fields, I'll have to think about it. It'll add a bit of complexity to the merge process which I'm not to keen on. But it is certainly possible. Sergei, what type of field is it that you need to update? And to everyone else on the list, is this a common action? That is, do you often need to update non-indexed fields? Cheers, Dave From crafterm at gmail.com Wed Jun 21 02:00:50 2006 From: crafterm at gmail.com (Marcus Crafter) Date: Wed, 21 Jun 2006 08:00:50 +0200 Subject: [Ferret-talk] Grep style output? In-Reply-To: References: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> Message-ID: David Balmain wrote: > On 6/13/06, Marcus Crafter wrote: > Hi Marcus, > > If you can read java the best way would be to check out the > highlighter in Apache Lucene and porting that code to Ruby. You can > see the highlighter module here; > > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/ > > I'm going to do this myself eventually but you'll have to do it > yourself if you need it soon. Before you put too much work into it > though, be warned that there are possible major Ferret API changes > ahead. Hi David, Thanks for your response. I noticed in a previous post you referenced the lucene highlighter and have already started porting it to Ferret. I'm already quite a ways along and have got the first 3 test cases passing properly (ie. simple and fuzzy fragments) and will continue with getting the rest of the test cases to work. Hopefully the API changes don't break too much then :) I'll post the code once it's all working, hopefully within the next days. Cheers, Marcus -- Posted via http://www.ruby-forum.com/. From crafterm at gmail.com Wed Jun 21 04:25:48 2006 From: crafterm at gmail.com (Marcus Crafter) Date: Wed, 21 Jun 2006 10:25:48 +0200 Subject: [Ferret-talk] In memory IndexReader bug? In-Reply-To: References: <71f43b0170e73d2fbb415d49d6cdcb11@ruby-forum.com> Message-ID: David Balmain wrote: > On 6/14/06, Marcus Crafter wrote: > Hi Marcus, > > Sorry, this is a mistake in the docs. It doesn't make sense to open an > IndexReader with an anonymous RAMDirectory as it obviously won't > contain any index yet. The problem is that the IndexReader is trying > to read the segments file which it expects to be there but, since no > index has been written, there is no segments file. If you pass a > RAMDirectory that actually contains an index written by an IndexWriter > or Index class then it should work. Hi David, Thanks mate for the information. I actually get the same problem when attempting to use a writer: @ramDIr = RAMDirectory.new @writer = IndexWriter.new(@ramDir) That gives me: IOError: No file segments /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/store/ram_store.rb:79:in `open_input' /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/segment_infos.rb:70:in `read' /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/index_writer.rb:108:in `initialize' /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/store/directory.rb:135:in `while_locked' /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/index_writer.rb:103:in `initialize' /sw/lib/ruby/1.8/monitor.rb:229:in `synchronize' /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/index_writer.rb:102:in `initialize' I didn't think that was expected. Essentially I'm trying to the following equivalent Lucene code: protected void setUp() throws Exception { ramDir = new RAMDirectory(); IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true); for (int i = 0; i < texts.length; i++) { addDoc(writer, texts[i]); } writer.optimize(); writer.close(); reader = IndexReader.open(ramDir); numHighlights = 0; } The only way I've been able to get it to work is by using an on disk index rather than a in-memory based one. Any thoughts? Cheers, Marcus -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Wed Jun 21 06:23:55 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 21 Jun 2006 19:23:55 +0900 Subject: [Ferret-talk] In memory IndexReader bug? In-Reply-To: References: <71f43b0170e73d2fbb415d49d6cdcb11@ruby-forum.com> Message-ID: On 6/21/06, Marcus Crafter wrote: > David Balmain wrote: > > On 6/14/06, Marcus Crafter wrote: > > Hi Marcus, > > > > Sorry, this is a mistake in the docs. It doesn't make sense to open an > > IndexReader with an anonymous RAMDirectory as it obviously won't > > contain any index yet. The problem is that the IndexReader is trying > > to read the segments file which it expects to be there but, since no > > index has been written, there is no segments file. If you pass a > > RAMDirectory that actually contains an index written by an IndexWriter > > or Index class then it should work. > > Hi David, > > Thanks mate for the information. I actually get the same problem when > attempting to use a writer: > > @ramDIr = RAMDirectory.new > @writer = IndexWriter.new(@ramDir) > > That gives me: > > IOError: No file segments > /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/store/ram_store.rb:79:in > `open_input' > /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/segment_infos.rb:70:in > `read' > /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/index_writer.rb:108:in > `initialize' > /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/store/directory.rb:135:in > `while_locked' > /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/index_writer.rb:103:in > `initialize' > /sw/lib/ruby/1.8/monitor.rb:229:in `synchronize' > /sw/lib/ruby/gems/1.8/gems/ferret-0.9.3/lib/ferret/index/index_writer.rb:102:in > `initialize' > > I didn't think that was expected. Essentially I'm trying to the > following equivalent Lucene code: > > protected void setUp() throws Exception > { > ramDir = new RAMDirectory(); > IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), > true); > for (int i = 0; i < texts.length; i++) > { > addDoc(writer, texts[i]); > } > > writer.optimize(); > writer.close(); > reader = IndexReader.open(ramDir); > numHighlights = 0; > } > > The only way I've been able to get it to work is by using an on disk > index rather than a in-memory based one. > > Any thoughts? You need to set the :create option to true. Or you could use the :create_if_missing option, it doesn't really make any difference. Personally, I'd just use the Index class. You should only need to fall back to the IndexWriter and IndexReader classes if you are doing more advanced stuff with index like writing your own filters. Anyway here is the code for your example; def setup ram_dir = RAMDirectory.new writer = IndexWriter.new(nil, :create => true) texts.each {|text| writer << text} writer.optimize writer.close reader = IndexReader.open(ram_dir); end If you are having trouble debugging something you can try require 'rferret' instead of 'ferret'. This will use the pure Ruby version of Ferret and you should be able to more easily find the problem. Hope that helps, Dave From dbalmain.ml at gmail.com Wed Jun 21 06:32:38 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 21 Jun 2006 19:32:38 +0900 Subject: [Ferret-talk] Grep style output? In-Reply-To: References: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> Message-ID: On 6/21/06, Marcus Crafter wrote: > David Balmain wrote: > > On 6/13/06, Marcus Crafter wrote: > > Hi Marcus, > > > > If you can read java the best way would be to check out the > > highlighter in Apache Lucene and porting that code to Ruby. You can > > see the highlighter module here; > > > > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/ > > > > I'm going to do this myself eventually but you'll have to do it > > yourself if you need it soon. Before you put too much work into it > > though, be warned that there are possible major Ferret API changes > > ahead. > > Hi David, > > Thanks for your response. > > I noticed in a previous post you referenced the lucene highlighter and > have already started porting it to Ferret. I'm already quite a ways > along and have got the first 3 test cases passing properly (ie. simple > and fuzzy fragments) and will continue with getting the rest of the test > cases to work. > > Hopefully the API changes don't break too much then :) > > I'll post the code once it's all working, hopefully within the next > days. > > Cheers, > > Marcus That'd be great. The new API shouldn't be too hard to adjust to. I'll be implementing the highlighter in C rather than in Ruby so I'll be interested to see how you go with it. The main difference in the API is that you won't specify the store, index and term_vector parameters per document field any more. This option will still be available but the behaviour will be slightly different. I'll go into more detail later. Cheers, Dave From marvin at rectangular.com Wed Jun 21 09:51:51 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Wed, 21 Jun 2006 06:51:51 -0700 Subject: [Ferret-talk] Grep style output? In-Reply-To: References: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> Message-ID: On Jun 21, 2006, at 3:32 AM, David Balmain wrote: > I'll > be implementing the highlighter in C rather than in Ruby so I'll be > interested to see how you go with it. > > The main difference in the API is that you won't specify the store, > index and term_vector parameters per document field any more. This > option will still be available but the behaviour will be slightly > different. I'll go into more detail later. How close is what you're going to be doing to the Lucene contrib highlighter? FWIW, the KinoSearch Highlighter uses similar techniques for adding tags and encoding, but the excerpt selection is pretty different. No TokenStream required, it uses a heat map. Right now it requires that the field have term vectors stored with positions and offsets, but it could be adapted to generate the vectors by re-analyzing. The principle advantage it has over the Lucene Highlighter in that it handles phrases properly: http://xrl.us/nm2z (Link to www.lucenebook.com) http://xrl.us/nm25 (Link to www.rectangular.com) Whatever algorithm we choose for Lucy, I hope it will meet that constraint. Higlighter.pm isn't that long (384 lines including docs) and if I didn't have an serious deadlines bearing down doing a Ruby version would be a great exercise for me. If you or Marcus want to check it out, the new version's only in subversion: http://xrl.us/nm28 (Link to www.rectangular.com) Marvin Humphrey Rectangular Research http://www.rectangular.com/ From dbalmain.ml at gmail.com Wed Jun 21 12:06:35 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 22 Jun 2006 01:06:35 +0900 Subject: [Ferret-talk] Grep style output? In-Reply-To: References: <4f1a5f63529e061bf952382b805a23cd@ruby-forum.com> Message-ID: On 6/21/06, Marvin Humphrey wrote: > > On Jun 21, 2006, at 3:32 AM, David Balmain wrote: > > > I'll > > be implementing the highlighter in C rather than in Ruby so I'll be > > interested to see how you go with it. > > > > The main difference in the API is that you won't specify the store, > > index and term_vector parameters per document field any more. This > > option will still be available but the behaviour will be slightly > > different. I'll go into more detail later. > > How close is what you're going to be doing to the Lucene contrib > highlighter? Well I haven't actually started it yet so we'll see. > FWIW, the KinoSearch Highlighter uses similar techniques for adding > tags and encoding, but the excerpt selection is pretty different. No > TokenStream required, it uses a heat map. Right now it requires that > the field have term vectors stored with positions and offsets, but it > could be adapted to generate the vectors by re-analyzing. > > The principle advantage it has over the Lucene Highlighter in that it > handles phrases properly: > > http://xrl.us/nm2z (Link to www.lucenebook.com) > http://xrl.us/nm25 (Link to www.rectangular.com) > > Whatever algorithm we choose for Lucy, I hope it will meet that > constraint. > > Higlighter.pm isn't that long (384 lines including docs) and if I > didn't have an serious deadlines bearing down doing a Ruby version > would be a great exercise for me. If you or Marcus want to check it > out, the new version's only in subversion: > > http://xrl.us/nm28 (Link to www.rectangular.com) Cool, I'll definitely check this out. Thanks Marvin. From sergei at redleafsoft.com Thu Jun 22 10:45:12 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Thu, 22 Jun 2006 16:45:12 +0200 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> Message-ID: <1fca4495d014627f8f807ad00247cffc@ruby-forum.com> Reported as a bug: http://ferret.davebalmain.com/trac/ticket/69 > If I were to wish for something in coming Ferret, I'd wish "stability". > I am getting seg_faults every other time I am doing this: -- Posted via http://www.ruby-forum.com/. From tbone at horetore.com Thu Jun 22 15:48:37 2006 From: tbone at horetore.com (Trent Steele) Date: Thu, 22 Jun 2006 21:48:37 +0200 Subject: [Ferret-talk] Partition results based on field Message-ID: <44af620d578e923f7693f6077898324f@ruby-forum.com> Hello all I'm using Ferret for a site wide search where I have several kinds of (similar) objects in a central index (using a "type" field containing the class name). This works great, and I can search all objects with one query. What I'd like to do now is to limit the results so that there will be a maximum of 10 (or 5 or whatever) results for each type.. I can't figure out how to do this, so I thought maybe someone brighter than me has done this before or knows how to do it? :) Trent Steele -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Thu Jun 22 19:25:20 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 23 Jun 2006 08:25:20 +0900 Subject: [Ferret-talk] Partition results based on field In-Reply-To: <44af620d578e923f7693f6077898324f@ruby-forum.com> References: <44af620d578e923f7693f6077898324f@ruby-forum.com> Message-ID: On 6/23/06, Trent Steele wrote: > Hello all > > I'm using Ferret for a site wide search where I have several kinds of > (similar) objects in a central index (using a "type" field containing > the class name). This works great, and I can search all objects with one > query. > > What I'd like to do now is to limit the results so that there will be a > maximum of 10 (or 5 or whatever) results for each type.. I can't figure > out how to do this, so I thought maybe someone brighter than me has done > this before or knows how to do it? :) > > Trent Steele Hi Trent, The way to do this is to search for more than you need and then actually go through each search result and count the types in a hash, only adding a doc if it's type count is under the threshold. If you failed to retrieve enough results then search again and repeat until you get the required number of results. For those of you who know the Lucene API, this is where a Hits class comes in handy. It'll be coming in a future version. For now I'll show you the easiest wat by doing a search and setting :num_docs to max_doc, thereby getting all search results in one go; def get_results(search_str, max_type = 5, num_required = 10) type_counter = Hash.new(0) results = [] index.search_each(search_str, :num_docs => index.size) do |doc_id, score| doc = index[doc_id] if type_counter[doc[:type]] < max_type results << doc type_counter[doc[:type]] += 1 end break if results.size >= num_required end return results end Hope that helps, Dave From sergei at redleafsoft.com Fri Jun 23 11:23:53 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Fri, 23 Jun 2006 17:23:53 +0200 Subject: [Ferret-talk] Can not rescue ferret exception. What is wrong? Message-ID: <356df6aa5b2cb5beec5da78e8b3bfdee@ruby-forum.com> Hi, I have a big index and wildcard query raises an exception. That is all right. The problem is I can not rescue this exception and it bombs right to user page. Why? I am on Linux. Ferret 0.9.3 with C extentions. >> class A >> def self.b >> Book.index.search('isbn:00*') >> rescue >> puts 'ok' >> end >> end => nil >> A.b (irb):3:in `search': : Error occured at :54 (Exception) Error: exception 6 not handled: Too many clauses from (irb):3:in `b' from (irb):8:in `irb_binding' from /usr/lib/ruby/1.8/irb/workspace.rb:52:in `irb_binding' from /usr/lib/ruby/1.8/irb/workspace.rb:52 -- Sergei Serdyuk Red Leaf Software LLC web: http://redleafsoft.com -- Posted via http://www.ruby-forum.com/. From mrvreddy at hotmail.com Fri Jun 23 11:50:05 2006 From: mrvreddy at hotmail.com (Raghuveer Mamilla) Date: Fri, 23 Jun 2006 17:50:05 +0200 Subject: [Ferret-talk] Problem with search Message-ID: <824ca6a9bb81d34e88d9a5144d063442@ruby-forum.com> Hi Iam working on search and using ferret for that Iam able to search properly but the problem is when i search using keyword "doctor" it presents me output but not when i search using "doctors" (plural) If anyone of you know this problem Please let me know -- Posted via http://www.ruby-forum.com/. From charlie.hubbard at gmail.com Sun Jun 25 13:37:52 2006 From: charlie.hubbard at gmail.com (Charlie) Date: Sun, 25 Jun 2006 19:37:52 +0200 Subject: [Ferret-talk] acts_as_ferret with existing data (Building an index?) Message-ID: <9548e6f67e0c2062518f7b0ba17fbe86@ruby-forum.com> Hi, I'm trying to get the acts_as_ferret plugin to work with my rails application, but it barfs with this error: No such file or directory - ./index/development/Book/segments I have existing data in my database, but it was added prior to me using the ferret plugin. How do I get it to index that data, or when does it index that data? Charlie -- Posted via http://www.ruby-forum.com/. From jasbur at gmail.com Sun Jun 25 16:28:41 2006 From: jasbur at gmail.com (Jasbur) Date: Sun, 25 Jun 2006 22:28:41 +0200 Subject: [Ferret-talk] Sorting results by column Message-ID: <1ab7c1cb86a668451c85963fd3d25211@ruby-forum.com> I have the acts_as_ferret plugin installed. Everything searches great, but I would like to limit the results (i.e. by 'end_date') and sort them (by 'end_date'). 'end_date' is a valid column in my "posts" table. Here's the code I have already: @posts = Post.find_by_contents(params[:query]) params[:query] comes from a form. I am replacing less efficent code that has the restrictions working : @posts = Post.find(:all, :conditions => [ '(title LIKE :search_query OR body LIKE :search_query) AND end_date >= :enddate', {:search_query => '%' + params[:query] + '%', :enddate => Time.now}], :order => 'end_date') I realize i have to pass something similar to :conditions and :order, but I can't seem to get ferret to bey the options. -- Posted via http://www.ruby-forum.com/. From marciorf at gmail.com Mon Jun 26 13:59:22 2006 From: marciorf at gmail.com (Marcio) Date: Mon, 26 Jun 2006 19:59:22 +0200 Subject: [Ferret-talk] Installing ferret on windows Message-ID: <58f5ef49ecc9010691f6c569aecba978@ruby-forum.com> Hello, Im running rails 1.1.2 and ferret 0.9.3 and cant install act_as_ferret. I tryed the following instruction at act_as_ferret wiki: "Installation Please use script/plugin install svn://projects.jkraemer.net/acts_as_ferret/tags/plugin/stable/acts_as_ferret for easy installation of the current stable version of the plugin. At the moment this is version 0.2.1 based on Subversion Rev. 51. This is supposed to work with Ferret 0.9.3 and Rails >= 1.0. " (ruby script/plugin install in my case as Im on a win box) I did what he sais. I also tryed changing the svn url from /tags/ to /trunk/ but it does not work. And it doesnt leave any messages on the command prompt... Anyone knows how I can install it? Thanks in advance -- Posted via http://www.ruby-forum.com/. From marvin at rectangular.com Mon Jun 26 13:02:52 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Mon, 26 Jun 2006 10:02:52 -0700 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: <1fca4495d014627f8f807ad00247cffc@ruby-forum.com> References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> <1fca4495d014627f8f807ad00247cffc@ruby-forum.com> Message-ID: [resending... for some reason, this didn't go through earlier...] On Jun 22, 2006, at 7:45 AM, Sergei Serdyuk wrote: >> If I were to wish for something in coming Ferret, I'd wish >> "stability". >> I am getting seg_faults every other time I am doing this: Dave, I see you've done some work with Valgrind, but I'm not sure how much. To catch errors and memory leaks with KinoSearch, I wrote up a simple script that runs the whole test suite under Valgrind. The test suite takes around 15 minutes to run that way instead of 9 seconds (on the one box where I have Valgrind available), so I only run it rarely -- always when preparing a release, and sometimes when debugging new or refactored C code. Some of the code in KinoSearch's test suite doesn't even produce output; it's just there to exercise an area where there might be memory problems. Do you have something like that going on with Ferret? It's been extremely helpful for me. I don't think I've seen a single segfault bug report since KinoSearch was released, though I have missed a couple memory leaks because the Valgrind output can be a little hard to interpret (there are a few harmless items in Perl that look like memory leaks to Valgrind, which makes real leaks harder to spot). Marvin Humphrey Rectangular Research http://www.rectangular.com/ From marvin at rectangular.com Thu Jun 22 19:27:02 2006 From: marvin at rectangular.com (Marvin Humphrey) Date: Thu, 22 Jun 2006 16:27:02 -0700 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: <1fca4495d014627f8f807ad00247cffc@ruby-forum.com> References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> <1fca4495d014627f8f807ad00247cffc@ruby-forum.com> Message-ID: <5253EAF7-F689-4EA5-B344-157A489AC3C7@rectangular.com> [resending... for some reason, this didn't go through this morning...] On Jun 22, 2006, at 7:45 AM, Sergei Serdyuk wrote: >> If I were to wish for something in coming Ferret, I'd wish >> "stability". >> I am getting seg_faults every other time I am doing this: Dave, I see you've done some work with Valgrind, but I'm not sure how much. To catch errors and memory leaks with KinoSearch, I wrote up a simple script that runs the whole test suite under Valgrind. The test suite takes around 15 minutes to run that way instead of 9 seconds (on the one box where I have Valgrind available), so I only run it rarely -- always when preparing a release, and sometimes when debugging new or refactored C code. Some of the code in KinoSearch's test suite doesn't even produce output; it's just there to exercise an area where there might be memory problems. Do you have something like that going on with Ferret? It's been extremely helpful for me. I don't think I've seen a single segfault bug report since KinoSearch was released, though I have missed a couple memory leaks because the Valgrind output can be a little hard to interpret (there are a few harmless items in Perl that look like memory leaks to Valgrind, which makes real leaks harder to spot). Marvin Humphrey Rectangular Research http://www.rectangular.com/ From sergei at redleafsoft.com Thu Jun 22 10:35:25 2006 From: sergei at redleafsoft.com (Sergei Serdyuk) Date: Thu, 22 Jun 2006 16:35:25 +0200 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> Message-ID: They are stored non-indexed fields. In my case I wanted to have some stock data in searchable index. This is not top priority, as I can really have a second index or a database and do lookups by :id. If I were to wish for something in coming Ferret, I'd wish "stability". I am getting seg_faults every other time I am doing this: def self.internal_field_values(fieldname) term_enum = @@reader.terms_from(Ferret::Index::Term.new(fieldname, "")); out = [] while term_enum.term and (term_enum.term.field == fieldname) # seg faults here out << term_enum.term.text break unless term_enum.next? end out end > As for just updating the stored-unindexed fields, I'll have to think > about it. It'll add a bit of complexity to the merge process which I'm > not to keen on. But it is certainly possible. Sergei, what type of > field is it that you need to update? And to everyone else on the list, > is this a common action? That is, do you often need to update > non-indexed fields? > > Cheers, > Dave -- Posted via http://www.ruby-forum.com/. From JanPrill at blauton.de Tue Jun 27 02:04:57 2006 From: JanPrill at blauton.de (Jan Prill) Date: Tue, 27 Jun 2006 08:04:57 +0200 Subject: [Ferret-talk] acts_as_ferret with existing data (Building an index?) In-Reply-To: <562a35c10606262253y4c9ba0f5m2d9d5c7892534bd0@mail.gmail.com> References: <9548e6f67e0c2062518f7b0ba17fbe86@ruby-forum.com> <562a35c10606262253y4c9ba0f5m2d9d5c7892534bd0@mail.gmail.com> Message-ID: <562a35c10606262304n73ad3c77w26485d78631e4b00@mail.gmail.com> Hi Charlie, the most simple method for doing this should be the rebuild_index() method of acts_as_ferret. From the API docs (http://projects.jkraemer.net/acts_as_ferret/rdoc/ ): rebuild_index() rebuild the index from all data stored for this model. This is called automatically when no index exists yet. TODO: the automatic index initialization only works if every model class has it's own index, otherwise the index will get populated only with instances from the first model loaded If this isn't working for you, you might (as it seems) 1. Post a ticket to acts_as_ferret trac 2. Write some code to build up an initial index yourself: Therefore you only need to find :all models you want to index, iterate over them and index them - calling save on them to trigger acts_as_ferret or bypassing acts_as_ferret by using ferret directly as described in http://ferret.davebalmain.com/trac, beginning with http://ferret.davebalmain.com/api/files/TUTORIAL.html Cheers, Jan > > > > On 6/25/06, Charlie wrote: > > > > Hi, > > > > I'm trying to get the acts_as_ferret plugin to work with my rails > > application, but it barfs with this error: > > > > No such file or directory - ./index/development/Book/segments > > > > I have existing data in my database, but it was added prior to me using > > the ferret plugin. How do I get it to index that data, or when does it > > index that data? > > > > Charlie > > > > -- > > Posted via http://www.ruby-forum.com/. > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060627/56a84994/attachment.html From kraemer at webit.de Tue Jun 27 02:32:50 2006 From: kraemer at webit.de (Jens Kraemer) Date: Tue, 27 Jun 2006 08:32:50 +0200 Subject: [Ferret-talk] Can not rescue ferret exception. What is wrong? In-Reply-To: <356df6aa5b2cb5beec5da78e8b3bfdee@ruby-forum.com> References: <356df6aa5b2cb5beec5da78e8b3bfdee@ruby-forum.com> Message-ID: <20060627063250.GI15787@cordoba.webit.de> On Fri, Jun 23, 2006 at 05:23:53PM +0200, Sergei Serdyuk wrote: > Hi, > > I have a big index and wildcard query raises an exception. That is all > right. The problem is I can not rescue this exception and it bombs right > to user page. Why? the errors raised by ferret aren't subclasses of StandardError, and 'rescue' without arguments defaults to 'rescue StandardError' 'rescue Exception' will rescue from all possible errors, including those from Ferret. regards, Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From tbone at horetore.com Tue Jun 27 05:26:00 2006 From: tbone at horetore.com (Trent Steele) Date: Tue, 27 Jun 2006 11:26:00 +0200 Subject: [Ferret-talk] Partition results based on field In-Reply-To: References: <44af620d578e923f7693f6077898324f@ruby-forum.com> Message-ID: David Balmain wrote: > Hi Trent, > > The way to do this is to search for more than you need and then > actually go through each search result and count the types in a hash, > only adding a doc if it's type count is under the threshold. If you > failed to retrieve enough results then search again and repeat until > you get the required number of results. For those of you who know the > Lucene API, this is where a Hits class comes in handy. It'll be coming > in a future version. For now I'll show you the easiest wat by doing a > search and setting :num_docs to max_doc, thereby getting all search > results in one go; > > def get_results(search_str, max_type = 5, num_required = 10) > type_counter = Hash.new(0) > results = [] > index.search_each(search_str, :num_docs => index.size) do > |doc_id, score| > doc = index[doc_id] > if type_counter[doc[:type]] < max_type > results << doc > type_counter[doc[:type]] += 1 > end > break if results.size >= num_required > end > return results > end > > Hope that helps, > Dave Hi, I suspected I'd have to do something like this. Thanks for putting me on the right path. Are there any concerns about scalability/speed when the index grows larger regarding searching the whole index like this? T -- Posted via http://www.ruby-forum.com/. From toredarell at gmail.com Tue Jun 27 05:51:38 2006 From: toredarell at gmail.com (Tore Darell) Date: Tue, 27 Jun 2006 11:51:38 +0200 Subject: [Ferret-talk] Using QueryParser vs building my own query Message-ID: Hello all I finally caved in and decided I should build my own query instead of relying on QueryParser to do the job for me, but I've hit a strange problem.. Here's how I build my query: #Main query query = Ferret::Search::BooleanQuery.new #Build query to match types typesquery = Ferret::Search::BooleanQuery.new @selected_types.each{|type| typesquery.add_query( Ferret::Search::TermQuery.new(Ferret::Index::Term.new('type', type)), Ferret::Search::BooleanClause::Occur::SHOULD ) } #Add types query to main query query.add_query(typesquery, Ferret::Search::BooleanClause::Occur::MUST) #Build query to match content and title contenttitlequery = Ferret::Search::BooleanQuery.new contenttitlequery.add_query( Ferret::Search::TermQuery.new(Ferret::Index::Term.new('content', params[:query])), Ferret::Search::BooleanClause::Occur::SHOULD ) contenttitlequery.add_query( Ferret::Search::TermQuery.new(Ferret::Index::Term.new('title', params[:query])), Ferret::Search::BooleanClause::Occur::SHOULD ) #Add content+title query to main query query.add_query(contenttitlequery, Ferret::Search::BooleanClause::Occur::MUST) The problem is that index.search(query) always gives me 0 results. However, if I do index.search(query.to_s) it returns the results I expect.. Here's an example output of query.to_s: +(type:Foo type:Bar) +(content:baz title:baz) What I'm trying to do is search for items that is of type Foo or Bar and has "baz" in either the content or the title. Does anyone have an idea of what's going on here? Thanks, Tore -- Posted via http://www.ruby-forum.com/. From guest at guest.com Tue Jun 27 14:43:16 2006 From: guest at guest.com (guest) Date: Tue, 27 Jun 2006 20:43:16 +0200 Subject: [Ferret-talk] Can't run WEBRick with Plugin Message-ID: <4594716165d5e9ae8b72f042dfb98393@ruby-forum.com> After copying the acts_as_ferret pluin to my rails folder, I can't boot WEBrick. Has anyone run into this before? -- Posted via http://www.ruby-forum.com/. From lmarlow at yahoo.com Tue Jun 27 16:57:02 2006 From: lmarlow at yahoo.com (Lee Marlow) Date: Tue, 27 Jun 2006 14:57:02 -0600 Subject: [Ferret-talk] Bus Error with Ferret 0.9.3 using the BooleanQuery api In-Reply-To: References: <590e5e196dc8248a49d05ee5e480ecf6@ruby-forum.com> Message-ID: <7968d7490606271357t2e5b6ce3ub63a02b7843b3f11@mail.gmail.com> I attached a test case to this ticket that reproduces a bus error and a segmentation fault on my macbook pro and another linux machine. http://ferret.davebalmain.com/trac/attachment/ticket/62/bus_error_and_segmentation_fault_test_cast.diff We were using ferret to help with our site navigation since the info was already in the index and we wouldn't need to maintain a denormalized table for the hierarchical data. Ferret was called on every page, sometimes multiple times per page. This instability has caused us to pull back on our use of ferret on the site. Let me know if there's more information I can provide that would help. Thanks -Lee On 6/16/06, David Balmain wrote: > On 6/11/06, Michael Koziarski wrote: > > Hey guys, > > > > I've been trying out ferret 0.9.3 on my powerbook this weekend and I've > > been triggering 'bus errors' when using the Query API. If I > > programmatically build up strings, it works just fine. > > > > There's some more information available in the trac ticket > > > > http://ferret.davebalmain.com/trac/ticket/62 > > > > Is anyone successfully using the Query API on mac os x? Anything I can > > do to help debug this? > > Hi Michael, > > I noticed in your code you are using BooleanQuery#add_clause method > but you are adding a query. Can you try BooleanQuery#add_query > instead? Let me know if that helps. > > Cheers, > Dave > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Tue Jun 27 20:05:26 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 28 Jun 2006 09:05:26 +0900 Subject: [Ferret-talk] Any fast way to update non-indexed fields? In-Reply-To: <5253EAF7-F689-4EA5-B344-157A489AC3C7@rectangular.com> References: <35fb43248589cfd29fa45a3dc40f58bb@ruby-forum.com> <1fca4495d014627f8f807ad00247cffc@ruby-forum.com> <5253EAF7-F689-4EA5-B344-157A489AC3C7@rectangular.com> Message-ID: On 6/23/06, Marvin Humphrey wrote: > [resending... for some reason, this didn't go through this morning...] > > On Jun 22, 2006, at 7:45 AM, Sergei Serdyuk wrote: > > >> If I were to wish for something in coming Ferret, I'd wish > >> "stability". > >> I am getting seg_faults every other time I am doing this: > > Dave, I see you've done some work with Valgrind, but I'm not sure how > much. To catch errors and memory leaks with KinoSearch, I wrote up a > simple script that runs the whole test suite under Valgrind. The > test suite takes around 15 minutes to run that way instead of 9 > seconds (on the one box where I have Valgrind available), so I only > run it rarely -- always when preparing a release, and sometimes when > debugging new or refactored C code. Some of the code in KinoSearch's > test suite doesn't even produce output; it's just there to exercise > an area where there might be memory problems. > > Do you have something like that going on with Ferret? It's been > extremely helpful for me. I don't think I've seen a single segfault > bug report since KinoSearch was released, though I have missed a > couple memory leaks because the Valgrind output can be a little hard > to interpret (there are a few harmless items in Perl that look like > memory leaks to Valgrind, which makes real leaks harder to spot). > Hi Marvin, I do use Valgrind. In fact the reason I have been so quiet on the list lately is I've been working really hard on cleaning up the code in Ferret so that I can realease a more stable version. The tool I need to make more use of is gcov. The problem is that some areas of the code just aren't getting exercised enough. Cheers, Dave From dbalmain.ml at gmail.com Tue Jun 27 20:24:28 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 28 Jun 2006 09:24:28 +0900 Subject: [Ferret-talk] Bus Error with Ferret 0.9.3 using the BooleanQuery api In-Reply-To: <7968d7490606271357t2e5b6ce3ub63a02b7843b3f11@mail.gmail.com> References: <590e5e196dc8248a49d05ee5e480ecf6@ruby-forum.com> <7968d7490606271357t2e5b6ce3ub63a02b7843b3f11@mail.gmail.com> Message-ID: Thanks Lee, I'm working on it. On 6/28/06, Lee Marlow wrote: > I attached a test case to this ticket that reproduces a bus error and > a segmentation fault on my macbook pro and another linux machine. > http://ferret.davebalmain.com/trac/attachment/ticket/62/bus_error_and_segmentation_fault_test_cast.diff > > We were using ferret to help with our site navigation since the info > was already in the index and we wouldn't need to maintain a > denormalized table for the hierarchical data. Ferret was called on > every page, sometimes multiple times per page. This instability has > caused us to pull back on our use of ferret on the site. > > Let me know if there's more information I can provide that would help. > > Thanks > > -Lee > > On 6/16/06, David Balmain wrote: > > On 6/11/06, Michael Koziarski wrote: > > > Hey guys, > > > > > > I've been trying out ferret 0.9.3 on my powerbook this weekend and I've > > > been triggering 'bus errors' when using the Query API. If I > > > programmatically build up strings, it works just fine. > > > > > > There's some more information available in the trac ticket > > > > > > http://ferret.davebalmain.com/trac/ticket/62 > > > > > > Is anyone successfully using the Query API on mac os x? Anything I can > > > do to help debug this? > > > > Hi Michael, > > > > I noticed in your code you are using BooleanQuery#add_clause method > > but you are adding a query. Can you try BooleanQuery#add_query > > instead? Let me know if that helps. > > > > Cheers, > > Dave > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Tue Jun 27 20:36:06 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 28 Jun 2006 09:36:06 +0900 Subject: [Ferret-talk] Partition results based on field In-Reply-To: References: <44af620d578e923f7693f6077898324f@ruby-forum.com> Message-ID: On 6/27/06, Trent Steele wrote: > David Balmain wrote: > > Hi Trent, > > > > The way to do this is to search for more than you need and then > > actually go through each search result and count the types in a hash, > > only adding a doc if it's type count is under the threshold. If you > > failed to retrieve enough results then search again and repeat until > > you get the required number of results. For those of you who know the > > Lucene API, this is where a Hits class comes in handy. It'll be coming > > in a future version. For now I'll show you the easiest wat by doing a > > search and setting :num_docs to max_doc, thereby getting all search > > results in one go; > > > > def get_results(search_str, max_type = 5, num_required = 10) > > type_counter = Hash.new(0) > > results = [] > > index.search_each(search_str, :num_docs => index.size) do > > |doc_id, score| > > doc = index[doc_id] > > if type_counter[doc[:type]] < max_type > > results << doc > > type_counter[doc[:type]] += 1 > > end > > break if results.size >= num_required > > end > > return results > > end > > > > Hope that helps, > > Dave > > Hi, > > I suspected I'd have to do something like this. Thanks for putting me on > the right path. Are there any concerns about scalability/speed when the > index grows larger regarding searching the whole index like this? As long as you're using the C backed version of Ferret, the index would have to grow very large before speed becomes a concern in this case. Note that Ferret actually has to go through every single search result anyway to check its score, no matter what you have num_docs set to. The only thing that you are using more of with a high value of num_docs is memory (approximately 12-bytes per hit). Cheers, Dave From kraemer at webit.de Wed Jun 28 04:27:22 2006 From: kraemer at webit.de (Jens Kraemer) Date: Wed, 28 Jun 2006 10:27:22 +0200 Subject: [Ferret-talk] Can't run WEBRick with Plugin In-Reply-To: <4594716165d5e9ae8b72f042dfb98393@ruby-forum.com> References: <4594716165d5e9ae8b72f042dfb98393@ruby-forum.com> Message-ID: <20060628082722.GN15787@cordoba.webit.de> On Tue, Jun 27, 2006 at 08:43:16PM +0200, guest wrote: > After copying the acts_as_ferret pluin to my rails folder, I can't boot > WEBrick. Has anyone run into this before? no, but if you gave some more info (what Webrick prints out when it refuses to start, for example), we could probably help. regards, Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From david.wennergren at gmail.com Wed Jun 28 06:42:41 2006 From: david.wennergren at gmail.com (David Wennergren) Date: Wed, 28 Jun 2006 12:42:41 +0200 Subject: [Ferret-talk] Problem searching with special characters Message-ID: <8f10bea7d388a58c7a7222950cb41fa4@ruby-forum.com> I'm using Ferret on a Swedish website and I get some unexpected behaviour on searches containing the swedish charchters ???. An exampel, if I index a string "Varf?r fungerar det inte" ("Why doesnt it work" in swedish) and search for "f?r" I'll get one (1) match. The expected behaviour would be no matches since 'f?r' is part of the word 'varf?r'. And if I do a search for "varf?*" it returns no matches. Expecting one. My guess is that it has something to do with the UTF-8 encoding but I can't seem to figure out exactly what is is... I'm using the StandardAnalyzer b.t.w. Any ideas? Thanks /David -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Wed Jun 28 10:51:14 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 28 Jun 2006 23:51:14 +0900 Subject: [Ferret-talk] Using QueryParser vs building my own query In-Reply-To: References: Message-ID: On 6/27/06, Tore Darell wrote: > Hello all > > I finally caved in and decided I should build my own query instead of > relying on QueryParser to do the job for me, but I've hit a strange > problem.. > > Here's how I build my query: > > #Main query > query = Ferret::Search::BooleanQuery.new > > #Build query to match types > typesquery = Ferret::Search::BooleanQuery.new > @selected_types.each{|type| > typesquery.add_query( > Ferret::Search::TermQuery.new(Ferret::Index::Term.new('type', > type)), > Ferret::Search::BooleanClause::Occur::SHOULD > ) > } > > #Add types query to main query > query.add_query(typesquery, Ferret::Search::BooleanClause::Occur::MUST) > > #Build query to match content and title > contenttitlequery = Ferret::Search::BooleanQuery.new > contenttitlequery.add_query( > Ferret::Search::TermQuery.new(Ferret::Index::Term.new('content', > params[:query])), > Ferret::Search::BooleanClause::Occur::SHOULD > ) > contenttitlequery.add_query( > Ferret::Search::TermQuery.new(Ferret::Index::Term.new('title', > params[:query])), > Ferret::Search::BooleanClause::Occur::SHOULD > ) > > #Add content+title query to main query > query.add_query(contenttitlequery, > Ferret::Search::BooleanClause::Occur::MUST) > > > > The problem is that index.search(query) always gives me 0 results. > However, if I do index.search(query.to_s) it returns the results I > expect.. > > Here's an example output of query.to_s: > > +(type:Foo type:Bar) +(content:baz title:baz) > > What I'm trying to do is search for items that is of type Foo or Bar and > has "baz" in either the content or the title. > > Does anyone have an idea of what's going on here? You must have used a lowercasing analyzer. You'll need to lowercase the "Foo" and "Bar". QueryParser will do that for you. An easy fix. Cheers, Dave From dbalmain.ml at gmail.com Wed Jun 28 11:03:43 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 29 Jun 2006 00:03:43 +0900 Subject: [Ferret-talk] Problem searching with special characters In-Reply-To: <8f10bea7d388a58c7a7222950cb41fa4@ruby-forum.com> References: <8f10bea7d388a58c7a7222950cb41fa4@ruby-forum.com> Message-ID: Hi David, Are you using Windows? Ferret on Windows won't handle UTF-8 unfortunately. If not, could you check your locale? "puts Ferret.locale". You can try setting the locale too. Let me know if you are still having problems. Cheers, Dave On 6/28/06, David Wennergren wrote: > I'm using Ferret on a Swedish website and I get some unexpected > behaviour on searches containing the swedish charchters ???. > > An exampel, if I index a string "Varf?r fungerar det inte" ("Why doesnt > it work" in swedish) and search for "f?r" I'll get one (1) match. The > expected behaviour would be no matches since 'f?r' is part of the word > 'varf?r'. > > And if I do a search for "varf?*" it returns no matches. Expecting one. > > My guess is that it has something to do with the UTF-8 encoding but I > can't seem to figure out exactly what is is... > > I'm using the StandardAnalyzer b.t.w. > > Any ideas? > > Thanks > > /David > > > > > > > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From far_neil at yahoo.ca Wed Jun 28 20:20:47 2006 From: far_neil at yahoo.ca (Neil Brandt) Date: Thu, 29 Jun 2006 02:20:47 +0200 Subject: [Ferret-talk] Possibly same issue as 'duplicate search results' topic? Message-ID: <65035c5b03ad8f5702afd5a59a8c5d79@ruby-forum.com> Unfortunately I'm a newbie to ruby, rails, and acts_as_ferret. Also, I'm working in code I mostly didn't write, so I'm kind of unsure of things. So this may be a dumb question. It also may be the same issue as the topic 'Duplicate search results', but I'm not sure. When I update column values with ApplicationController's update_attribute, I am getting additional ferret index entries rather than replacing the old one. As a result, searches get hits when you search on a historical value that is no longer the value of the field. I assume that I am somehow misusing acts_as_ferret? My call to add the mixin to the model class looks like this: acts_as_ferret :store_class_name => true, :fields => ['id', 'code', 'description'] As a test (and a workaround), in acts_as_ferret.rb module InstanceMethods, I replaced: alias :ferret_update :ferret_create with: def ferret_update self.ferret_destroy self.ferret_create end This ensures unique indexes for me. But I'm guessing it's redundant and something else is not used or working as intended. Thanks for reading, Neil -- Posted via http://www.ruby-forum.com/. From anatol.pomozov at gmail.com Thu Jun 29 00:10:12 2006 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Thu, 29 Jun 2006 06:10:12 +0200 Subject: [Ferret-talk] Ferret Win32 Gem for windows users ... In-Reply-To: <448EBEB0.2010401@blackkettle.org> References: <126EC586577FD611A28E00A0C9A037587E6BE6@maui.bmsoft.com.au> <4483EA12.3020003@blackkettle.org> <448D4AE4.8040603@blackkettle.org> <0cc93aa5b293ee16fff9e44b63401803@ruby-forum.com> <448EBEB0.2010401@blackkettle.org> Message-ID: <632829a0387d8308e76245634f109563@ruby-forum.com> +1 I always wander why for Ruby used NSVC and not MinGW. MinGW much better for multiplatform apps. i.e. Postgres project uses MinGW for building Windows version of its product - and they have no problem. Why Ruby community stucked on MSVC then?? Alex Young wrote: > As it stands, I'm recompiling ruby under MinGW, and then attacking the > extensions. The medium-term goal is to be able to give some pointers to > Curt Hibbs et al (should they need them - I'm sure they've got it in > hand) to be able to replace the MSVC build with a MinGW build in the > One-Click Installer. -- Posted via http://www.ruby-forum.com/. From kraemer at webit.de Thu Jun 29 03:18:19 2006 From: kraemer at webit.de (Jens Kraemer) Date: Thu, 29 Jun 2006 09:18:19 +0200 Subject: [Ferret-talk] Possibly same issue as 'duplicate search results' topic? In-Reply-To: <65035c5b03ad8f5702afd5a59a8c5d79@ruby-forum.com> References: <65035c5b03ad8f5702afd5a59a8c5d79@ruby-forum.com> Message-ID: <20060629071818.GP15787@cordoba.webit.de> Hi, On Thu, Jun 29, 2006 at 02:20:47AM +0200, Neil Brandt wrote: > Unfortunately I'm a newbie to ruby, rails, and acts_as_ferret. Also, > I'm working in code I mostly didn't write, so I'm kind of unsure of > things. So this may be a dumb question. It also may be the same issue > as the topic 'Duplicate search results', but I'm not sure. > > When I update column values with ApplicationController's > update_attribute, I am getting additional ferret index entries rather > than replacing the old one. As a result, searches get hits when you > search on a historical value that is no longer the value of the field. > > I assume that I am somehow misusing acts_as_ferret? > > > My call to add the mixin to the model class looks like this: > > acts_as_ferret :store_class_name => true, :fields => ['id', 'code', > 'description'] you should not name the 'id' field in the list of fields, it's always added to the index automatically. Probably this will solve your problem. regards, Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From guest at guest.com Thu Jun 29 05:04:36 2006 From: guest at guest.com (Guest) Date: Thu, 29 Jun 2006 11:04:36 +0200 Subject: [Ferret-talk] Installing ferret on windows In-Reply-To: <58f5ef49ecc9010691f6c569aecba978@ruby-forum.com> References: <58f5ef49ecc9010691f6c569aecba978@ruby-forum.com> Message-ID: Hey Marcio, I ran into the same thing just yesterday! Ughh! I don't have any insight into this, but just wanted to let you know you're not the only one. Marcio wrote: > Hello, Im running rails 1.1.2 and ferret 0.9.3 and cant install > act_as_ferret. > > I tryed the following instruction at act_as_ferret wiki: > "Installation > > Please use > > script/plugin install > svn://projects.jkraemer.net/acts_as_ferret/tags/plugin/stable/acts_as_ferret > > for easy installation of the current stable version of the plugin. At > the moment this is version 0.2.1 based on Subversion Rev. 51. This is > supposed to work with Ferret 0.9.3 and Rails >= 1.0. " > > (ruby script/plugin install in my case as Im on a win box) > > I did what he sais. I also tryed changing the svn url from /tags/ to > /trunk/ but it does not work. And it doesnt leave any messages on the > command prompt... > > Anyone knows how I can install it? > > Thanks in advance -- Posted via http://www.ruby-forum.com/. From JanPrill at blauton.de Thu Jun 29 05:10:34 2006 From: JanPrill at blauton.de (Jan Prill) Date: Thu, 29 Jun 2006 11:10:34 +0200 Subject: [Ferret-talk] Installing ferret on windows In-Reply-To: References: <58f5ef49ecc9010691f6c569aecba978@ruby-forum.com> Message-ID: <562a35c10606290210l33e70789t59d180f6f350a9dc@mail.gmail.com> Hi you two, I've ran into the some issues with the script/plugin script on windows the last few days myself. Which have nothing to do with acts_as_ferret. You don't rely on script/plugin for installing acts_as_ferret. With a subversion client like tortoise on windows or whatever you like you might check out the acts_as_ferret repository and simply copy the checked out version to RAILS_ROOT/vendor/plugins and your ready to go... Cheers, Jan On 6/29/06, Guest wrote: > > > Hey Marcio, > > I ran into the same thing just yesterday! Ughh! I don't have any insight > into this, but just wanted to let you know you're not the only one. > > Marcio wrote: > > Hello, Im running rails 1.1.2 and ferret 0.9.3 and cant install > > act_as_ferret. > > > > I tryed the following instruction at act_as_ferret wiki: > > "Installation > > > > Please use > > > > script/plugin install > > > svn://projects.jkraemer.net/acts_as_ferret/tags/plugin/stable/acts_as_ferret > > > > for easy installation of the current stable version of the plugin. At > > the moment this is version 0.2.1 based on Subversion Rev. 51. This is > > supposed to work with Ferret 0.9.3 and Rails >= 1.0. " > > > > (ruby script/plugin install in my case as Im on a win box) > > > > I did what he sais. I also tryed changing the svn url from /tags/ to > > /trunk/ but it does not work. And it doesnt leave any messages on the > > command prompt... > > > > Anyone knows how I can install it? > > > > Thanks in advance > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060629/c4470cc1/attachment-0001.html From toredarell at gmail.com Thu Jun 29 09:23:37 2006 From: toredarell at gmail.com (Tore Darell) Date: Thu, 29 Jun 2006 15:23:37 +0200 Subject: [Ferret-talk] Using QueryParser vs building my own query In-Reply-To: References: Message-ID: <045378fe84a32b12af7e682b9c88c385@ruby-forum.com> David Balmain wrote: > On 6/27/06, Tore Darell wrote: >> >> #Add types query to main query >> Ferret::Search::TermQuery.new(Ferret::Index::Term.new('title', >> The problem is that index.search(query) always gives me 0 results. >> Does anyone have an idea of what's going on here? > You must have used a lowercasing analyzer. You'll need to lowercase > the "Foo" and "Bar". QueryParser will do that for you. An easy fix. > > Cheers, > Dave You're right, thanks. For some reason I had set the type field to be tokenised.. Tore -- Posted via http://www.ruby-forum.com/. From far_neil at yahoo.ca Thu Jun 29 09:29:01 2006 From: far_neil at yahoo.ca (Neil Brandt) Date: Thu, 29 Jun 2006 15:29:01 +0200 Subject: [Ferret-talk] Possibly same issue as 'duplicate search results' topic? In-Reply-To: <20060629071818.GP15787@cordoba.webit.de> References: <65035c5b03ad8f5702afd5a59a8c5d79@ruby-forum.com> <20060629071818.GP15787@cordoba.webit.de> Message-ID: Yes, that's it. That was easy! Thank you very much. Neil Jens Kraemer wrote: > Hi, > > On Thu, Jun 29, 2006 at 02:20:47AM +0200, Neil Brandt wrote: >> I assume that I am somehow misusing acts_as_ferret? >> >> >> My call to add the mixin to the model class looks like this: >> >> acts_as_ferret :store_class_name => true, :fields => ['id', 'code', >> 'description'] > > you should not name the 'id' field in the list of fields, it's always > added to the index automatically. Probably this will solve your problem. > > regards, > Jens > > -- > webit! Gesellschaft f?r neue Medien mbH www.webit.de > Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de > Schnorrstra?e 76 Tel +49 351 46766 0 > D-01069 Dresden Fax +49 351 46766 66 -- Posted via http://www.ruby-forum.com/. From charlie.hubbard at gmail.com Thu Jun 29 10:00:14 2006 From: charlie.hubbard at gmail.com (Charlie) Date: Thu, 29 Jun 2006 16:00:14 +0200 Subject: [Ferret-talk] find_by_contents not returning SearchResults? Message-ID: The acts_as_ferret documentation says find_by_content returns an instance of SearchResults, but I see this error when I try to use the results. undefined method `total_hits' for []:Array Here is the link to the documentation: http://projects.jkraemer.net/acts_as_ferret/rdoc/classes/FerretMixin/Acts/ARFerret/ClassMethods.html#M000010 But here is the actual code: result = [] hits = index_searcher.search(query, options) hits.each do |hit, score| id = index_searcher.reader.get_document(hit)[:id] begin res = self.find(id) result << res if res logger.debug "result id: #{id}, result: #{res}" rescue logger.debug "no data for id #{id}" end end return result What's wrong? Do I need another version of the plugin? -- Posted via http://www.ruby-forum.com/. From kraemer at webit.de Thu Jun 29 10:11:19 2006 From: kraemer at webit.de (Jens Kraemer) Date: Thu, 29 Jun 2006 16:11:19 +0200 Subject: [Ferret-talk] find_by_contents not returning SearchResults? In-Reply-To: References: Message-ID: <20060629141119.GU15787@cordoba.webit.de> On Thu, Jun 29, 2006 at 04:00:14PM +0200, Charlie wrote: > The acts_as_ferret documentation says find_by_content returns an > instance of SearchResults, but I see this error when I try to use the > results. > > undefined method `total_hits' for []:Array > > Here is the link to the documentation: > > http://projects.jkraemer.net/acts_as_ferret/rdoc/classes/FerretMixin/Acts/ARFerret/ClassMethods.html#M000010 > > But here is the actual code: > > result = [] > hits = index_searcher.search(query, options) > hits.each do |hit, score| > id = index_searcher.reader.get_document(hit)[:id] > begin > res = self.find(id) > result << res if res > logger.debug "result id: #{id}, result: #{res}" > rescue > logger.debug "no data for id #{id}" > end > end > return result > > What's wrong? Do I need another version of the plugin? yes, this feature is only in svn trunk at this time. The API docs are generated from trunk automatically, maybe we should fix this and keep the docs for each version we release. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From charlie.hubbard at gmail.com Thu Jun 29 10:22:06 2006 From: charlie.hubbard at gmail.com (Charlie) Date: Thu, 29 Jun 2006 16:22:06 +0200 Subject: [Ferret-talk] find_by_contents not returning SearchResults? Message-ID: <18b5e6df6953edb1513ed745aee4b17c@ruby-forum.com> The acts_as_ferret documentation says find_by_content returns an instance of SearchResults, but I see this error when I try to use the results. undefined method `total_hits' for []:Array Here is the link to the documentation: http://projects.jkraemer.net/acts_as_ferret/rdoc/classes/FerretMixin/Acts/ARFerret/ClassMethods.html#M000010 What's wrong? Charlie -- Posted via http://www.ruby-forum.com/. From Pedro.CorteReal at iantt.pt Thu Jun 29 10:44:40 2006 From: Pedro.CorteReal at iantt.pt (Pedro =?ISO-8859-1?Q?C=F4rte-Real?=) Date: Thu, 29 Jun 2006 15:44:40 +0100 Subject: [Ferret-talk] find_by_contents not returning SearchResults? In-Reply-To: <20060629141119.GU15787@cordoba.webit.de> References: <20060629141119.GU15787@cordoba.webit.de> Message-ID: <1151592281.12185.1.camel@localhost.localdomain> On Thu, 2006-06-29 at 16:11 +0200, Jens Kraemer wrote: > yes, this feature is only in svn trunk at this time. The API docs are > generated from trunk automatically, maybe we should fix this and keep > the docs for each version we release. Are you also going to fix this for the next version? There's a patch attatched that I am using, although the code is a little hackish. Without it sorting results seems impossible. http://projects.jkraemer.net/acts_as_ferret/ticket/9 Pedro. From kraemer at webit.de Thu Jun 29 13:04:05 2006 From: kraemer at webit.de (Jens Kraemer) Date: Thu, 29 Jun 2006 19:04:05 +0200 Subject: [Ferret-talk] find_by_contents not returning SearchResults? In-Reply-To: <1151592281.12185.1.camel@localhost.localdomain> References: <20060629141119.GU15787@cordoba.webit.de> <1151592281.12185.1.camel@localhost.localdomain> Message-ID: <20060629170405.GV15787@cordoba.webit.de> On Thu, Jun 29, 2006 at 03:44:40PM +0100, Pedro C?rte-Real wrote: > On Thu, 2006-06-29 at 16:11 +0200, Jens Kraemer wrote: > > yes, this feature is only in svn trunk at this time. The API docs are > > generated from trunk automatically, maybe we should fix this and keep > > the docs for each version we release. > > Are you also going to fix this for the next version? There's a patch > attatched that I am using, although the code is a little hackish. > Without it sorting results seems impossible. > > http://projects.jkraemer.net/acts_as_ferret/ticket/9 I think you speak of this one - http://projects.jkraemer.net/acts_as_ferret/ticket/20 this will be fixed in the soon-to-be-released next version. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From Pedro.CorteReal at iantt.pt Thu Jun 29 13:19:48 2006 From: Pedro.CorteReal at iantt.pt (Pedro =?ISO-8859-1?Q?C=F4rte-Real?=) Date: Thu, 29 Jun 2006 18:19:48 +0100 Subject: [Ferret-talk] find_by_contents not returning SearchResults? In-Reply-To: <20060629170405.GV15787@cordoba.webit.de> References: <20060629141119.GU15787@cordoba.webit.de> <1151592281.12185.1.camel@localhost.localdomain> <20060629170405.GV15787@cordoba.webit.de> Message-ID: <1151601589.12185.6.camel@localhost.localdomain> On Thu, 2006-06-29 at 19:04 +0200, Jens Kraemer wrote: > On Thu, Jun 29, 2006 at 03:44:40PM +0100, Pedro C?rte-Real wrote: > > On Thu, 2006-06-29 at 16:11 +0200, Jens Kraemer wrote: > > > yes, this feature is only in svn trunk at this time. The API docs are > > > generated from trunk automatically, maybe we should fix this and keep > > > the docs for each version we release. > > > > Are you also going to fix this for the next version? There's a patch > > attatched that I am using, although the code is a little hackish. > > Without it sorting results seems impossible. > > > > http://projects.jkraemer.net/acts_as_ferret/ticket/9 > > I think you speak of this one - > http://projects.jkraemer.net/acts_as_ferret/ticket/20 Yes, although the one I posted was a similar issue. Messed up when looking for my bugreport. > this will be fixed in the soon-to-be-released next version. Cool. I hate having non-standard patches to stuff. It would also be cool to have a cleaner API to do sorting than the ferret one. One that uses the field names passed to acts_as_ferret. Ferret is great but it's API seems to be too much like Java and not like most ruby API's. I ended up building a small class to encapsulate searching for my rails model to hide all that away. Pedro. From kraemer at webit.de Thu Jun 29 17:05:58 2006 From: kraemer at webit.de (Jens Kraemer) Date: Thu, 29 Jun 2006 23:05:58 +0200 Subject: [Ferret-talk] find_by_contents not returning SearchResults? In-Reply-To: <1151601589.12185.6.camel@localhost.localdomain> References: <20060629141119.GU15787@cordoba.webit.de> <1151592281.12185.1.camel@localhost.localdomain> <20060629170405.GV15787@cordoba.webit.de> <1151601589.12185.6.camel@localhost.localdomain> Message-ID: <20060629210558.GA30043@cordoba.webit.de> On Thu, Jun 29, 2006 at 06:19:48PM +0100, Pedro C?rte-Real wrote: [..] > > this will be fixed in the soon-to-be-released next version. > > Cool. I hate having non-standard patches to stuff. It would also be cool > to have a cleaner API to do sorting than the ferret one. One that uses > the field names passed to acts_as_ferret. Ferret is great but it's API > seems to be too much like Java and not like most ruby API's. I ended up > building a small class to encapsulate searching for my rails model to > hide all that away. Good point, but I'd rather wait for ferret's upcoming API changes before doing such changes in acts_as_ferret. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66 From me at julik.nl Thu Jun 29 22:59:47 2006 From: me at julik.nl (Julik) Date: Fri, 30 Jun 2006 04:59:47 +0200 Subject: [Ferret-talk] Substantial problems with write locking (and other flux) Message-ID: <8603da99552206db2a0720a2261abf55@ruby-forum.com> I am having some great trouble keeping my Ferret indexer for ActiveRecord working. First the get_field_names disappears (now back), then I am collectig some major trouble with locking. Same thing here: exception 6 not handled: Could not obtain write lock when trying to write index A snippet like this just deadlocks retrying endlessly: begin @ferret_index << doc @ferret_index.flush() @ferret_index.close() rescue Exception => e # No, he couldn't define a proper class for this if e.to_s.include?('Could not obtain write lock') reopen_index # opens the index again! retry else raise e end end How are we supposed to handle concurrency with a file store? I can't find anything in the wiki and actually I am getting very frustrated. It's the third gem update of Ferret and my plugin just got broken, I can't repair it since. Also the habit of throwing Exceptions is somewhat obnoxious because they are not standard errors. Might Ferret once get his own error class tree? -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Fri Jun 30 19:02:58 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 1 Jul 2006 08:02:58 +0900 Subject: [Ferret-talk] Substantial problems with write locking (and other flux) In-Reply-To: <8603da99552206db2a0720a2261abf55@ruby-forum.com> References: <8603da99552206db2a0720a2261abf55@ruby-forum.com> Message-ID: On 6/30/06, Julik wrote: > I am having some great trouble keeping my Ferret indexer for > ActiveRecord working. > > First the get_field_names disappears (now back), then I am collectig > some major trouble with locking. Same thing here: I ported 20,000 lines of ruby code to C. I apologize if some parts of the API got left out in the process. People wanted a faster search library so that's what I'm working on. It's still beta, although it should still be alpha. > exception 6 not handled: Could not obtain write lock when trying to > write index Do you have more than one process writing to the index? The C version of Ferret currently doesn't wait long enough for the lock to be released. This behaviour will improve once I get the current release finished although I'm guessing that locking problems are always going to occur. > A snippet like this just deadlocks retrying endlessly: > > begin > @ferret_index << doc > @ferret_index.flush() > @ferret_index.close() > rescue Exception => e # No, he couldn't define a proper class for this > if e.to_s.include?('Could not obtain write lock') > reopen_index # opens the index again! > retry > else > raise e > end > end > How are we supposed to handle concurrency with a file store? I can't > find anything in the wiki and actually I am getting very frustrated. > It's the third gem update of Ferret and my plugin just got broken, I > can't repair it since. > > Also the habit of throwing Exceptions is somewhat obnoxious because they > are not standard errors. Might Ferret once get his own error class tree? I'm working on this too. I've done a significant rewrite of cFerret so that it will gel better with Ruby. I've basically rewritten 10,000 LOC which is why I haven't had time to respond to the most recent tickets. All the problems people are having are going to be addressed. Exception handling will be significantly improved. Hopefully I'll be able to eliminate segfaults. A windows version is also on the way. So please be patient. Or fix it yourself and send in a patch. Cheers, Dave From ryan at theryanking.com Fri Jun 30 19:21:49 2006 From: ryan at theryanking.com (ryan king) Date: Sat, 1 Jul 2006 01:21:49 +0200 Subject: [Ferret-talk] acts_as_ferret rdoc Message-ID: <363e75235d2229acb19f84afb2b1dddf@ruby-forum.com> The wiki for acts_as_ferret claims that the rdoc is available at http://projects.jkraemer.net/acts_as_ferret/rdoc, but that page 404s. Is the rdoc up somewhere? thanks, ryan -- Posted via http://www.ruby-forum.com/. From me at julik.nl Fri Jun 30 21:49:11 2006 From: me at julik.nl (Julik) Date: Sat, 1 Jul 2006 03:49:11 +0200 Subject: [Ferret-talk] Substantial problems with write locking (and other flux) In-Reply-To: References: <8603da99552206db2a0720a2261abf55@ruby-forum.com> Message-ID: David Balmain wrote: > It's still beta, although it > should still be alpha. Got the message :-) > Do you have more than one process writing to the index? The C version > of Ferret currently doesn't wait long enough for the lock to be > released. This behaviour will improve once I get the current release > finished although I'm guessing that locking problems are always going > to occur. No, it throws within a unit test which runs inside a single process. I ain't even got to any real concurrency yet :-) > I'm working on this too. I've done a significant rewrite of cFerret so > that it will gel better with Ruby. I've basically rewritten 10,000 LOC > which is why I haven't had time to respond to the most recent tickets. I realise I vented too much, so apologies for that. Great kudos to you for the work you are doing on Ferret. Would you please put out a gem with the "get_field_names" back in the picture? This is the only thing broken after some major patching I did on my plugin. Would you be so kind to give me a tip what "get_field_names" ought to return when casted to an Array so that I can replicate it's functionality (I have the names of fields stored elsewhere, I just need to pass them when constructing the search). TIA. -- Posted via http://www.ruby-forum.com/. From me at julik.nl Fri Jun 30 21:50:18 2006 From: me at julik.nl (Julik) Date: Sat, 1 Jul 2006 03:50:18 +0200 Subject: [Ferret-talk] Substantial problems with write locking (and other flux) In-Reply-To: References: <8603da99552206db2a0720a2261abf55@ruby-forum.com> Message-ID: <2057f6a26dd78ef9355774349d1cf8af@ruby-forum.com> Julik wrote: > No, it throws within a unit test which runs inside a single process. I > ain't even got to any real concurrency yet :-) And to be fair - the problem seems to be removed altogether by using auto_flush. So now the get_field_names thing is the only one and then I can hack further and switch an app to Ferret. -- Posted via http://www.ruby-forum.com/. From dbalmain.ml at gmail.com Fri Jun 30 22:08:55 2006 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 1 Jul 2006 11:08:55 +0900 Subject: [Ferret-talk] find_by_contents not returning SearchResults? In-Reply-To: <20060629210558.GA30043@cordoba.webit.de> References: <20060629141119.GU15787@cordoba.webit.de> <1151592281.12185.1.camel@localhost.localdomain> <20060629170405.GV15787@cordoba.webit.de> <1151601589.12185.6.camel@localhost.localdomain> <20060629210558.GA30043@cordoba.webit.de> Message-ID: On 6/30/06, Jens Kraemer wrote: > On Thu, Jun 29, 2006 at 06:19:48PM +0100, Pedro C?rte-Real wrote: > [..] > > > this will be fixed in the soon-to-be-released next version. > > > > Cool. I hate having non-standard patches to stuff. It would also be cool > > to have a cleaner API to do sorting than the ferret one. One that uses > > the field names passed to acts_as_ferret. Ferret is great but it's API > > seems to be too much like Java and not like most ruby API's. I ended up > > building a small class to encapsulate searching for my rails model to > > hide all that away. > > Good point, but I'd rather wait for ferret's upcoming API changes before > doing such changes in acts_as_ferret. > > Jens That's a very good plan. While we are on the subject, how do you think the sort API should look? Once we get to a 1.0 release we are going to be stuck with that API for a while so I want to get it right before then and the sooner the better. Also, what other areas of the API do you feel need work. For starters, I'll be getting rid of the Parameter class. Instead of Field::Index::TOKENIZED it'll just be :index => :yes or :index => :untokenized etc. Anyway, I'd love to hear any feed back on any part of the API. Let's start with the Sort API. Cheers, Dave