 |
Forums |
Admin Start New Thread
By: Leigh Klotz, Jr
RE: Word definition and debug [ reply ] 2007-09-22 23:14
|
Here's the current state of my word_hash_for_words.
It would probably make sense to parameterize these definitions in Classifier, perhaps via a block if not parameters.
My apologies for the indentation; I can't figure out how to preserve it.
I also made some efforts to improve efficiency with of expensive operations not performed until after tests.
This code isn't what I'd recommend; it's just a snapshot of what I'm using, to help with the effort to parameterize word finding.
# def word_hash_for_words(words)
# d = Hash.new
# words.each do |word|
# word.downcase! if word =~ /[\w]+/
# key = word.stem.intern
# if word =~ /[^\w]/ || ! CORPUS_SKIP_WORDS.include?(word) && word.length > 3
# d[key] ||= 0
# d[key] += 1
# end
# end
# return d
# end
def word_hash_for_words(words)
d = Hash.new
words.each do |word|
if word =~ /[\w]/ && word !~ /^\d+$/ && word.length > 1 && word.length < 24
word.downcase!
if ! CORPUS_SKIP_WORDS.include?(word)
key = word.stem.intern
d[key] ||= 0
d[key] += 1
end
end
end
return d
end
|
By: Leigh Klotz, Jr
Word definition and debug [ reply ] 2007-09-22 23:09
|
I've found the word definition needed to be tuned for my application. Specifically, acronyms of 2 and 3 letters with mixed alphanumerics were excluded by the length > 2 rule.
I wanted to exclude just plain numbers, and any strings that were too long, so I arbitrarily chose 24 characters as the max.
To test all this, I used this routine:
module Classifier
# add debug info to Bayes
class Bayes
def words()
w = Array.new
@categories.each do |category, category_words|
category_words.keys.each { |word| w << word }
end
w
end
end
end
|
|
 |