It's my understanding that the tokens in a token_stream consist of text along with start/stop positions that represent the byte positions of the text within the corresponding document field. The documentation I've been reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte positions represent positions within the entire field but based on my testing it appears that the byte positions are with respect to the line that contains the corresponding text within the field. I read my fields following Brian McCallister:<br>
<br><pre> index.add_document :file => path, <br> :content => file.readlines<br></pre><br>Hence, if I have a file that contains carriage returns, the token positions will be reset with each new line. For example, the following file contents (File A) <br>
this is a sentence<br>will result in a token for the text "sentence" with start position equal to 10 (assume "this" starts in position 0) while a file with a carriage return<br> this is a <br>
sentence<br>will result in a token for the text "sentence" with start position
equal to 0. I get the same results for my custom tokenizer as well as StandardTokenizer. The above does not seem consistent with the documentation but more importantly, it seems that global positions are more useful than line-based positions (e.g., for highlighting).<br>
<br>Digging a little deeper it seems that the tokenizer's initialize method is called each time the token_stream method of the containing analyzer is called:<br><br>class CustomAnalyzer<br> def token_stream(field, str)<br>
ts = StandardTokenizer.new(str)<br> end<br>end<br><br>Am I missing something here? Are the start/stop byte positions intended to be with respect to the line? Is there a way for token_stream to only be called once for an entire string sequence (even if carriage returns are contained)? <br>
<br>Thanks,<br>John <br>
<br><br><br><br>