Subject: Re: mechanised documentation and my business model solution
From: Rich Morin <rdm@cfcl.com>
Date: Sun, 26 Mar 2006 03:04:41 -0800

At 10:33 PM -0800 3/25/06, Ben Tilly wrote:
> On 3/25/06, Rich Morin <rdm@cfcl.com> wrote:
>> At 6:55 PM -0800 3/25/06, Ben Tilly wrote:
> I didn't use wikis as an example.  I used source control
> systems, of which there are many, and which generally
> support versioning pretty well.

Versioning is central to the task of source code control.
So, these systems spend quite a bit of effort on the issue.
Most wikis do not, but they may borrow some functionality.

For example, TWiki (a file-based wiki) keeps its files in
RCS.  This solves part of the versioning problem, but not
all of it.  CVS or Subversion would certainly solve other
issues, if either could be shoe-horned in.  In any case,
we've exceeded both my interest and expertise.


> * Tree-like data structures don't play well with a RDBMS.

Trees (and, more to the point, graphs) are hard to map onto
RDBMSs and SQL.  FWIW, Joe Celko has a book on the topic:

  "Joe Celko's Trees and Hierarchies in SQL for Smarties"
  Morgan Kaufmann, 2004, ISBN 1-55860-920-2


However, the strengths of this approach are sufficient that
many graph-based systems (eg, Protege) are based on RDBMSs,
though not always in a way that makes DBMS experts happy.


> * RDBMS are usually not set up to efficiently store many
>   versions of the same document.

Given the cost of current and upcoming storage media, great
storage inefficiencies may be quite acceptable.  Also, note
that systems such as RCS spend CPU cycles (and the user's
time) to gain storage efficiency.  This may not be the best
trade-off, particularly when searching is involved.


> Several of the wins that you list also seem to me to be
> wins for different kinds of applications than wikis.  For
> instance, wikis are pretty much all text, so consistent
> keys, data types, etc aren't a big factor.

Current wikis make little use of connectivity information,
link typing, etc.  Semantic Wikis could change this story
substantially.  In any case, it's nice that RDBMSs are able
to handle these things reliably and efficiently.


>> ... my aspirations were never that lofty.
>
> They seemed pretty close to me.  The fundamental ...

I'm quite a fan of "documentation generators" (eg, Doxygen,
JavaDoc, POD).  Knuth's Literate Programming is nifty, in
principle, but appears to be too laborious for most coders.

But if the only benefit of these systems were to cause the
programmers to write comments "in the code", there would be
no reason to have an automated extraction system.  Just put
in the comments and be done with it!

The real win of these systems, IMNSHO, is the integration of
human-generated text with machine-harvested data.  I can go
to a Doxygen web page and see the text, supplemented by all
sorts of links, tooltips, and relationship diagrams.  This
lets me "browse" the code at a high level, dropping down to
the source code when desired.


> However such systems fall into two general categories.
> The first are clear improvements that you take for granted
> and use as a base for your expectations and desires.
> (Consider the lowly FAQ.)

You say that like it's a bad thing (:-).


> The second are ones that solve problems that someone else
> thought was important, but you don't.

I don't understand this; please explain.


> I'd find this vision far more plausible if you had some
> clearly laid out use cases ...

Good question.  Let's assume that our user is a developer
who is diving into an unfamiliar body of code, possibly to
debug it, make changes, or perform system integration.

As noted above, documentation generators integrate text and
data, presenting a combination of overview material, detailed
information, relationship diagrams, etc.  This can be quite
useful in the situations described above.

However, existing documentation generators are quite limited.
They only deal with the kinds of entities and relationships
that can be found in the source code.  So, for example, they
ignore topics such as access control, bug reports, data flow,
test results, etc.  They also ignore existing documentation,
such as man pages.  As a result, they cannot support browsing
over the full range of topics that a developer might desire.


Another use case might involve a system administrator who is
trying to resolve a configuration problem.  S/he has little or
no interest in the source code, but access control and data
flow may be quite relevant.  So, it should be easy to find out
which files are part of a subsystem, which files are typically
accessed (in what manner) by which programs and users, etc.

Although the documentation system can't discern the reason why
a file gets read, it can easily detect the absence of any text
on the subject.  And, if it makes this apparent to developers
(and convenient to remedy), something might get done about it.

Even if the developers fail to provide this information, all
is not lost.  The user can post a question on the page (it's a
wiki, after all), causing a notification to be sent out to any
"interested" parties.  Finally, the user can add comments to
the documentation, easily and with no administrative time lag.


In a final use case, an operator might want a "dashboard" that
displays summary information on the OS (eg, how much CPU time
and memory each program has used over the course of the day).
Something like the "metrics" application described in the rant,
but with a much simpler set of constraints.

A simple daemon can collect the relevant information and put
it in a database.  A mechanically-augmented wiki can then
allow the operator to request given "reports" on a particular
page.  The format of these requests might be something like
Thomas Gries described in his Wikimania 2004 slide set:

  Getlets: extending the interwiki concept beyond any limits
  http://upload.wikimedia.org/wikibooks/en/a/a9/Wikimania05_Workshop_TG3.pdf

Of course, the requests would have to be fielded by some sort
of server, but a combination of SQL queries and output filters
can wash quite a bit of this sort of laundry:

  Polyglot Programming
  http://www.cfcl.com/~rdm/weblog/archives/000998.html


> A few messages ago you summarized what you want as a system
> where the developer adds a few comments for the documentation
> system, everything gets tied together into a model, and then
> great documentation is produced.

Well, "few" and "great" are your terms.  What I would say is that
a documentation system can track instances of the entities and
relationships it "knows about".  It can combine this data with
any available annotations, generating a page for each instance.


> I look at that vision and I think that programmers are going
> to have to learn a new (and in the end probably fairly large)
> meta language to figure out how the meta language works, ...

The Doxygen developers need to understand the vagaries of data
structures and functions, but a site administrator does not.
Because the "model" is embodied in the Doxygen program, only
the input (eg, source code directories) needs to be specified.
This can take some time, but it isn't all that difficult.

In my proposed system, the administrator might have more things
to configure (depending on how smart my installation code is :),
but s/he shouldn't need a deep understanding of the model.

So, although model development uses a combination of knowledge
engineering and domain expertise (and a meta-language, as you
suspect :-), simply installing and using the system shouldn't
require any of this.

I would also hope to keep the meta-language simple and obvious
enough, and the system modular enough, that additions could be
made by any competent programmer.


> ... the gap between the different kinds of documentation that
> different audiences need does not seem to me to be possible
> to bridge in any automated kind of way.

Well, there is certainly some truth to that.  A novice may need
a simplified tutorial, while an expert will want great detail.
Writing for diverse audiences is challenging for a human writer;
asking a program to generate this sort of text is AI-complete.

However, this isn't what I'm proposing to do.  Rather, my goal
is to let the user inspect some entity (eg, document, file,
program) and find out what other entities are related to it, in
what ways.  By letting the user follow "interesting" links (and
perhaps adjust which links are shown), the development system
can present a "view" of the target system that has the desired
amount of depth, detail, etc.


> ... your documentation system looks to be so complex that one
> would need a domain expert just to understand what it was
> trying to do.

Just as the typical user of Wikipedia (and indeed, the web) is
blissfully unaware of the underpinnings of the system, I expect
my users to be largely unaware of mine.  The only exception is
that, as the user learns about the types of entities that the
documentation system presents, s/he would be learning about the
nature of the target system.  This, indeed, is a key notion of
Model-based Documentation.


> And it looks like it needs a lot of information to be input
> into the system so it has something to work with - enough
> that you'd cripple any agile team that tried to use that
> as a documentation mechanism.

I think I covered this above, but let's be clear.  My system
can only document types of entities and relationships that it
"knows about".  However, as long as the team uses traditional
entities (eg, files, processes, programs) whose instances can
be harvested by existing input filters, the documentation
system should be able to handle any resulting combinations.


> The simpler any system is to use, the more likely people are
> to use it.  The more process that you require from people
> adding to the system, the less likely people are to [add to]
> it.

Violent agreement!  Neither the user nor the programmer should
need to know how the documentation system works.  Programmers
just (POD-style) add comments to the code.  Users just browse
pages and (wiki-style) add comments and questions.


> That's the genius of a wiki - it hits a sweet spot where
> it provides just enough to be useful, but simply enough
> that people really use it.

I've been following the literature on Semantic Wikis with a
great deal of interest.  One of the points that gets made a
lot is that contributors to (say) Wikipedia aren't going to
be willing to deal with formal ontologies, topic maps, etc.

So, the typical approach is to add a single "type" attribute
to the link syntax.  Given that we already know the current
and destination pages, this completes an RDF triple.  It then
becomes the job of the wiki (and follow-on) software to make
sense out of the resulting graph, use it for navigation, etc.

Although I think this is a fine plan for Wikipedia, I don't
think it is an optimal solution for mechanized documentation.
Whereas Wikipedia's subject area is broad and chaotic, the
systems I'm talking about documenting are (comparatively :-)
constrained and well-organized.

So, a typical target system may have millions of instances,
but it will only have a few dozen classes.  Using this class
knowledge, we can structure collection and presentation of
the instance information.  A user doesn't have to know about
these classes; they simply get presented implicitly as part
of the generated content.

However, because it IS a wiki, any user can create new pages
that comment on and/or link to the generated pages, etc.
This may well produce a "sweet spot" of its own...


In closing, I should emphasize that the preceding only gives
MY notions about how to do mechanized documentation.  Even if
I'm off track on the specifics, ANY general and useful set of
tools for mechanized documentation would benefit vendors who
want their source code to be readily understood.

-r
-- 
http://www.cfcl.com/rdm            Rich Morin
http://www.cfcl.com/rdm/resume     rdm@cfcl.com
http://www.cfcl.com/rdm/weblog     +1 650-873-7841

Technical editing and writing, programming, and web development