[Rubygems-developers] Suggestions: categories and querying

Eivind Eklund eivind at FreeBSD.org
Fri Sep 17 06:01:25 EDT 2004

The below is long because I've wanted to write on the topic of
classification in general for a long time.  It is slightly rambling
because I've been in physical pain while writing it, and don't have the
mental capacity to edit it properly.

It is a sort of essay - it winds through a lot of thoughts about
categorization, and end up with some suggestions for what RubyGems can
do to get good results.

The essay can also be seen as a partial description of the flaw that
makes the entire world of software work as badly as it does.

On Thu, Sep 16, 2004 at 11:48:10AM +0200, Mauricio Fernández wrote:
> Well the code is there, but the ontology is yet to be created...

I think it would be a good idea to start with just stealing CPAN's
classifications.  Those classifications works fairly well for CPAN in
practice, and are maintained by a single smart "librarian" who is
actually doing the necessary work.

Do NOT take inspiration from FreeBSD's port categories - they are fairly
bad, and RubyGems should do much better.

> In the past you ("the RubyGems team") have expressed your concerns about
> classifications being too rigid, etc. I believe that the hierarchical
> classification should be managed carefully (look at RAA...), but you
> can give quite a lot of free room for developers by:
> * allowing multiple categorization
> * providing a keyword-based system in a addition to the hierarchical approach.
>   You'd have to find a way to encourage reuse of existent keywords,
>   while making it possible to introduce new ones. 

Keywords and multiple hierarchical categories are, in reality, the same
thing.  Keywords are just a hierarchical categorization with only a
single level.  And most non-trivial categorization problems needs
multiple categories.

Fortunately, we've got a group of professionals that know a lot about
categorization, and has it as one of their primary work skills:

I've got a couple of librarian friends and some in the family, and have
discussed the challenge of categorization with them a number of times.
The following are key insights I feel I've gotten from these
discussions, as well as observations about why (good) libraries work as
well as they do.

But first, a quick description of how a librarian works with categories.

Most libraries use a standard taxonoy (system of classification) for
technical books, a system known as the "Dewey Decimal Classification
System".  This is a hierarchical system of categories (hundreds of
thousands of categories), prepared by a common board of librarians and
updated every few years [FIXME verify against dewey.com].  As
books come into a library, the librarian goes through the category list
(there are several volumes listing categories if you use the complete
system) and picks out what should be the primary category for the book
(the shelves are usually ordered by the category numbers), and what
other categories the book should be in.  Then, depending on how
computerized the library is, the librarian either print out index cards
for each category and put in the index, or just add the category data to
the computer record in question.

* Obviously, the above process is a lot of work.  Especially the part of
  finding categories at each library sounds extremely wasteful.  So my
  first question on hearing it fully described is "Wouldn't it be a good
  idea to have the publisher or a central board or the Library of
  Congress or something assign categories once per book?"  The answer
  was quite surprising: "No."  It turns out that in order to optimize
  for the specific use a particular library sees, the librarians vary
  how they use the categories.  At least for professional libraries (my
  mother used an institute library she worked for as an exmaple), they
  vary it enough that having pre-defined categories from a central
  repository would be of close to no help.

* A library vary what categories they use depending on need.  They have
  a LARGE set of categories available, and each library use a part of
  it.  For instance, in the biotechnical institute library discussed
  above, there was a single shelf devoted to computer books, just a few
  hundred.  This meant that almost none of the subcategories for
  computers were in use, and that most of the depth there went unused -
  categorization stopped after a few levels.

  On the other hand, the biochemistry categories contained a lot of
  books, and there the depth and richness of the categorization system
  comes to its right - to such a degree that there is significant
  judgement involved in the category use.

* Individual librarians do NOT add categories to their taxonomy,
  Instead, they use a centralized category set, even though they vary
  how they use it.

* Individual items (books) are placed with a LOT of side categories.
  Sometimes ten or more.  The books appear in all the relevant indexes
  where people might be looking for something like them.

* Each item is described with a brief but noticable amount of
  information in the index.  This include when it was published (AKA
  last updated), who the authors are, what organization has quality
  controlled it (the publisher), and what primary category it is filed

* There are two basic ways to to categorize: By function, and by "what
  is the object".  Categorizing by function (what somebody use the
  object for) breaks down a lot more often than categorizing by what the
  object is.

  Thus, librarians generally categorize by "what a thing is".  
  However, they fake a little - they also add in occasional references
  to help find by what category somebody is likely to be looking in when
  doing a particular function.

* The categorization of books and software has one purpose and one
  purpose alone: Making it easy for somebody to find the right thing(s)
  when they are looking.

* Categorization is HARD.  There is a lot of choice involved in doing
  it, and a lot of skill.  How categorization is done vary from context
  to context and person to person.  Thus, in order to make it consistent,
  it is necessary to have a strongly communicating group with a strong
  common base and a lot of communication about categorization.

* Creating sets of categories is even harder than categorization.
  Ad-hoc creation categories is seen as too difficult to get right for
  people whose primary skill is working with categories.

* Even for professional categorizers, it is often necessary to discuss
  with another professional categorizer to find what category/ies to
  place something in.

I believe the primary problem for the software community isn't that we
do not have the software we need - it is that we cannot FIND the
software that is already written that does what we need.  Very often,
the search cost for finding some piece of software exceeds the cost of
writing said software.

Good archive work (like librarians do) require a lot of thought, it
require good basic tools (the category set to use), and it require doing
the work many many times in order to learn how to think properly.  It
also require coordination and common learning between several people, to
get good results.

The best chance RubyGems has of making this work at all well is to
provide a "categorization service" - have authors come on IRC or mail in
descriptions of their packages, and help them find what categories to
list that package in.  Most authors will write one, two, maybe five
packages.  Classifying five packages is nowhere near enough to gain the
experience to know what categories to use.  Also, we'd like the
categories to be good from the first package an author writes - so
somebody that is familiar with the taxonomy needs to assign the
categories, and getting familiar with the taxonomy to that level (which
is a much deeper level than just using it to search with) is too large a
burden on the authors.  Specifically, it is a burden they will not
shoulder - so unless somebody else does it, the work will be done badly
and the categorization will be useless.  (See e.g. Freshmeat for what
happens when you let authors do categorization - the categories there
are very very close to useless, as at least half of what is in each
category is wrong for the category, and about half the packages miss
from categories they should have been in.

So: The authors will need help.

Also, there is a need to establish a suitably large taxonomy.  The ones
I've seen in use for computer software is woefully lacking.  The ones
that I'd take inspiration from is CPAN, Freshmeat and RubyForge; all of
them have their problems, but they are at least decent attempts.

However, to make this work out, taking inspiration from good places
probably isn't enough.  One has to have a software librarian use the
taxonomy to categorize stuff, get feedback from users, change the
taxonomy and recategorize, then lather, rinse, repeat - a number of
times.  And one need several people to help the librarian with this.

There are two places that could do this well at the moment: RAA (if
somebody adopted doing the librarian work for it), and RPA (which has
Mauricio as it's librarian already).  I think RubyGems' best bet is to
NOT add categorization at all at this time, but instead cooperate
closely with one of the above, and help them generate really good
categorization, and when good categories are available, start helping
authors find categories for their software.

Anything else is doomed to chaos and a false sense of being helpful.


More information about the Rubygems-developers mailing list