First, thanks to Jens K. for pointing a stupid error on my part regarding the use of test_token_stream(). <br><br>My current problem, a custom tokenizer I've written in Ruby does not properly create an index (or at least searches on the index don't work). Using test_token_stream() I have verified that my tokenizer properly creates the token_stream; certainly each Token's attributes are set properly. Nevertheless, simple searches return zero results. <br>
<br>The essence of my tokenizer is to skip beyond XML tags in a file and break up and return text components as tokens. I use this approach as opposed to an Hpricot approach because I need to keep track of the location of the text with respect to XML tags since after a search for a phrase I'll want to extract the nearby XML tags as they contain important context. My tokenizer (XMLTokenizer) contains a the obligatory initialize, next and text methods (shown below) as well as a lot of parsing methods that are called at the top level by the method XMLTokenizer.get_next_token which is the primary action within next. I didn't add the details of get_next_token as I'm assuming that if each token produced by get_next_token has the proper attributes then it shouldn't be the cause of the problem. What more should I be looking for? I've been looking for a custom tokenizer written in Ruby to model after; any suggestions? <br>
<br> def initialize(xmlText)<br> @xmlText = xmlText.gsub(/[;,!]/, ' ')<br> @currPtr = 0<br> @currWordStart = nil<br> @currTextStart = 0<br> @nextTagStart = 0<br> @startOfTextRegion = 0<br>
<br> @currTextStart = \<br> XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText)<br> @nextTagStart = \<br> XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) <br> @currPtr = @currTextStart<br>
@startOfTextRegion = 1<br> end<br><br> def next<br> tkn = get_next_token<br> if tkn != nil<br> puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, tkn.text]<br> end<br>
return tkn<br> end<br><br> def text=(text)<br> initialize(text)<br> @xmlText<br> end<br><br>Below is text from a previous, related message that shows that StopFiltering is not working:<br><br><pre>
><i> I've written a tokenizer/analyzer that parses a file extracting tokens and<br></i>><i> operate this analyzer/tokenizer on ASCII data consisting of XML files (the<br></i>><i> tokenizer skips over XML elements but maintains relative positioning). I've<br>
</i>><i> written many units tests to check the produced token stream and was<br></i>><i> confident that the tokenizer was working properly. Then I noticed two<br></i>><i> problems:<br></i>><i> <br></i>><i> 1. StopFilter (using English stop words) does not properly filter the<br>
</i>><i> token stream output from my tokenizer. If I explicitly pass an array of stop<br></i>><i> words to the stop filter it still doesn't work. If I simply switch my<br></i>><i> tokenizer to a StandardTokenizer the stop words are appropriately filtered<br>
</i>><i> (of course the XML tags are treated differently).<br></i>><br>><i> 2. When I try a simple search no results come up. I can see that my<br></i>><i> tokenizer is adding files to the index but a simple search (using<br>
</i>><i> Ferret::Index::Index.search_each) produces no results.<br></i></pre><br>Any suggestions are appreciated. <br><br>John<br>