Forums | Admin

Discussion Forums: open-discussion

Start New Thread Start New Thread

 

By: Leigh Klotz, Jr
RE: Word definition and debug [ reply ]  
2007-09-22 23:30
Thanks. It should use a set instead of a hash, but it's mostly for debugging as I've done it. Still, if someone cares about correctness, it would be better.

By: Cameron McBride
RE: Word definition and debug [ reply ]  
2007-09-22 23:18
I'll have a closer look at your suggestions later this week. Thanks. (and feel free to keep posting!)

By: Leigh Klotz, Jr
RE: Word definition and debug [ reply ]  
2007-09-22 23:14
Here's the current state of my word_hash_for_words.
It would probably make sense to parameterize these definitions in Classifier, perhaps via a block if not parameters.

My apologies for the indentation; I can't figure out how to preserve it.

I also made some efforts to improve efficiency with of expensive operations not performed until after tests.

This code isn't what I'd recommend; it's just a snapshot of what I'm using, to help with the effort to parameterize word finding.

# def word_hash_for_words(words)
# d = Hash.new
# words.each do |word|
# word.downcase! if word =~ /[\w]+/
# key = word.stem.intern
# if word =~ /[^\w]/ || ! CORPUS_SKIP_WORDS.include?(word) && word.length > 3
# d[key] ||= 0
# d[key] += 1
# end
# end
# return d
# end

def word_hash_for_words(words)
d = Hash.new
words.each do |word|
if word =~ /[\w]/ && word !~ /^\d+$/ && word.length > 1 && word.length < 24
word.downcase!
if ! CORPUS_SKIP_WORDS.include?(word)
key = word.stem.intern
d[key] ||= 0
d[key] += 1
end
end
end
return d
end

By: Leigh Klotz, Jr
Word definition and debug [ reply ]  
2007-09-22 23:09
I've found the word definition needed to be tuned for my application. Specifically, acronyms of 2 and 3 letters with mixed alphanumerics were excluded by the length > 2 rule.

I wanted to exclude just plain numbers, and any strings that were too long, so I arbitrarily chose 24 characters as the max.

To test all this, I used this routine:

module Classifier
# add debug info to Bayes
class Bayes
def words()
w = Array.new
@categories.each do |category, category_words|
category_words.keys.each { |word| w << word }
end
w
end
end
end