Subject: Re: Mechanizing community information resources
From: Rich Morin <rdm@cfcl.com>
Date: Mon, 23 Sep 2002 12:29:42 -0700

Thanks to Forrest for the notes on RocketAware.  Indexing and organization
are certainly difficult issues.  I don't claim to have any special handle
on them, but here are some thoughts:

   *  It isn't possible to organize all knowledge into a single hierarchy.
      The "tree of knowledge fallacy", in fact, refers to the fallacy of
      thinking that you can do so!

      As a side-issue, I've had some interesting discussions with folks
      about why people like to use hierarchies.  One friend thinks that it
      has to do with our basic brain structure.  My own take is that there
      are two basic reasons.  First, in the "real world", it's impossible
      to put something into two categories (read, containers) at the same
      time.  Second, even in the virtual world, setting up secondary links
      (e.g., aliases) is such a pain that it is reserved for situations
      which _really_ need it.  In short, hierarchies are simple, "natural",
      and work well for many situations, so we use them for everything (:-).

   *  Even in the areas where hierarchies work well, the trees can easily
      get "bushy" or "stringy".  In CS terms, the searching advantage of a
      tree is the log(N) search time it gives us; if the tree isn't well
      formed, this advantage goes away.

   *  In areas where we use graphs (e.g., the FILES and SEE ALSO references
      for man pages, some sets of web pages), the bi-directional nature of
      links is seldom well exploited.  If I'm sitting on the "foo(1)" page,
      the only links I can use are the ones on this page; a link from the
      "bar(1)" page to the current page doesn't help me at all.  Also, any
      chains of links have to be traversed manually.

   *  Relationships between items may be fuzzy, unbalanced (A is important
      to B, but B is not all that relevant to A) or worse, situationally
      dependent.  In one context, A and B may be strongly related; in
      others, they may be totally irrelevant to each other.

   *  A great many relationships can be "harvested" from operating systems,
      including man page cross-references, hard and symbolic links, time-
      based activity correlations, process-based file genealogies, etc.  By
      coupling this information with human annotations, it's possible to
      build up a picture of which files were accessed by which programs, to
      what purpose.

My own approach, in any case, is to record graph-based relationships with
bi-directional, attributed links (read, a semantic network), then use rules
to determine how much importance to give a given link in a given context.

I have written an exploratory system (the FreeBSD Browser) in Perl.  It
performed well, but it was both brittle and limited in capabilities.  I am
working toward a production implementation which will couple Perl with an
inference engine (e.g., OpenCyc).  The starting point for this system will
be a set of file management tools.

-r
-- 
email: rdm@cfcl.com; phone: +1 650-873-7841
http://www.cfcl.com/rdm    - my home page, resume, etc.
http://www.cfcl.com/Meta   - The FreeBSD Browser, Meta Project, etc.
http://www.ptf.com/dossier - Prime Time Freeware's DOSSIER series
http://www.ptf.com/tdc     - Prime Time Freeware's Darwin Collection