Subject: Re: mechanised documentation and my business model solution
From: "Ben Tilly" <btilly@gmail.com>
Date: Mon, 27 Mar 2006 11:02:31 -0800

 Mon, 27 Mar 2006 11:02:31 -0800
On 3/26/06, Rich Morin <rdm@cfcl.com> wrote:
> At 10:33 PM -0800 3/25/06, Ben Tilly wrote:
> > On 3/25/06, Rich Morin <rdm@cfcl.com> wrote:
> >> At 6:55 PM -0800 3/25/06, Ben Tilly wrote:
> > I didn't use wikis as an example.  I used source control
> > systems, of which there are many, and which generally
> > support versioning pretty well.
>
> Versioning is central to the task of source code control.
> So, these systems spend quite a bit of effort on the issue.
> Most wikis do not, but they may borrow some functionality.

Yes, they do.  And the people who've actually written them have
realized that a relational database system does NOT offer much utility
to help someone who wants to build one of these, and so they don't use
relational database systems.

> For example, TWiki (a file-based wiki) keeps its files in
> RCS.  This solves part of the versioning problem, but not
> all of it.  CVS or Subversion would certainly solve other
> issues, if either could be shoe-horned in.  In any case,
> we've exceeded both my interest and expertise.

If you want to succeed in your goal, I'd suggest focussing your
interest and improving your expertise.  Versioning is a hard problem. 
And it is critical to how software developers work.  If you think that
you want documentation to be versioned, I can guarantee that you'll
encounter lots of developers who want it to be versioned along with
their software.

Heck, at $work we have a bug-tracking/request ticket system that we
hook into source control in the stupidest possible way.  (If a checkin
refers to rt/xxxx, that checkin resolves that ticket.)  And we get an
immense amount of value out of being able to tie together those two,
even in as simple a way as that.

> > * Tree-like data structures don't play well with a RDBMS.
>
> Trees (and, more to the point, graphs) are hard to map onto
> RDBMSs and SQL.  FWIW, Joe Celko has a book on the topic:
>
>   "Joe Celko's Trees and Hierarchies in SQL for Smarties"
>   Morgan Kaufmann, 2004, ISBN 1-55860-920-2

I know that.  I've read his book.  I already gave you the condensed
version, which was, "Tree-like data structures don't play well with a
RDBMS."  If you need to make them play together, there are a number of
ways of trying to do so, each of which makes different tradeoffs.  In
particular since people more often are reading trees than writing
them, there is a strategy that he recommends which makes reads cheap
but writes expensive.

> However, the strengths of this approach are sufficient that
> many graph-based systems (eg, Protege) are based on RDBMSs,
> though not always in a way that makes DBMS experts happy.

I'm not saying that RDBMSs are a bad tool.  Despite the disadvantages,
they have lots of things to recommend them.  However they aren't
well-suited to addressing versioning.

You may choose to use one despite my criticism, but I'm saying that,
"makes versioning simple" should not be listed as one of the big
benefits.

> > * RDBMS are usually not set up to efficiently store many
> >   versions of the same document.
>
> Given the cost of current and upcoming storage media, great
> storage inefficiencies may be quite acceptable.  Also, note
> that systems such as RCS spend CPU cycles (and the user's
> time) to gain storage efficiency.  This may not be the best
> trade-off, particularly when searching is involved.

Let's think about a revision control system that took this attitude. 
Suppose we have a moderate sized project that several developers have
been working on for a couple of years.  Say 100,000 lines, broken up
into 100 files of average size 1000 lines.  Each file has been through
200 versions.  The average line is 60 characters long.  That makes
each file about 60K in size, times 200 is 12 MB per file.  Times 100
files is 1.2 GB of data.  Actually it will be larger than that,
because the largest files will also have the most revisions.

This is starting to add up.

Now consider how much work it will be to do the equivalent of "cvs
annotate" or "svn blame".

Now consider that the kind of stuff that you're talking about is more
useful for large projects than small.

Now consider the needs of a shared hosting facility like sourceforge.

Moore is great and all, but it doesn't absolve developers of a need
for common sense.

> > Several of the wins that you list also seem to me to be
> > wins for different kinds of applications than wikis.  For
> > instance, wikis are pretty much all text, so consistent
> > keys, data types, etc aren't a big factor.
>
> Current wikis make little use of connectivity information,
> link typing, etc.  Semantic Wikis could change this story
> substantially.  In any case, it's nice that RDBMSs are able
> to handle these things reliably and efficiently.

I don't know enough about semantic wikis to have an opinion on that.

> >> ... my aspirations were never that lofty.
> >
> > They seemed pretty close to me.  The fundamental ...
>
> I'm quite a fan of "documentation generators" (eg, Doxygen,
> JavaDoc, POD).  Knuth's Literate Programming is nifty, in
> principle, but appears to be too laborious for most coders.
>
> But if the only benefit of these systems were to cause the
> programmers to write comments "in the code", there would be
> no reason to have an automated extraction system.  Just put
> in the comments and be done with it!

As a developer I want to see different kinds of comments "in the code"
and in an external document.

> The real win of these systems, IMNSHO, is the integration of
> human-generated text with machine-harvested data.  I can go
> to a Doxygen web page and see the text, supplemented by all
> sorts of links, tooltips, and relationship diagrams.  This
> lets me "browse" the code at a high level, dropping down to
> the source code when desired.

That isn't a win for 2 of the 3 examples that you named.  Both JavaDoc
and POD wind up displaying text that that humans explicitly decided to
display.

Doxygen is different.  It does extract stuff in an automated way from
the source code.  Which is neat, but it imposes limitations on the
source language.  For instance consider the following example from
Class::DBI:

  package Music::DBI;
  use base 'Class::DBI';
  Music::DBI->connection('dbi:mysql:dbname', 'username', 'password');

  package Music::Artist;
  use base 'Music::DBI';
  Music::Artist->table('artist');
  Music::Artist->columns(All => qw/artistid name/);
  Music::Artist->has many(cds => 'Music::CD');

  package Music::CD;
  use base 'Music::DBI';
  Music::CD->table('cd');
  Music::CD->columns(All => qw/cdid artist title year reldate/);
  Music::CD->has many(tracks => 'Music::Track');
  Music::CD->has a(artist => 'Music::Artist');
  Music::CD->has a(reldate => 'Time::Piece',
    inflate => sub { Time::Piece->strptime(shift, "%Y-%m-%d") },
    deflate => 'ymd',
  );

What does this do?  Well it has created 3 classes (and presumably
you're going to create more soon) with associated constructors,
accessors etc that are going to be backed by some tables in a given
database.  But all that a Doxygen-like tool is going to see are the
existence of the classes and the inheritance relationship between
them.  It will know nothing about what those classes do.  (Unless it
is taught about Class::DBI, which means that it still won't understand
the homegrown ORM that we developed at $work before Class::DBI was
around.  And it won't handle smarter ORMs that autogenerate most of
what you see above by peeking at the database.)

That is the reason why Doxygen doesn't support any dynamic languages.

Given the choice between a whiz-bang documentation system and some
facilities for metaprogramming like this, I'm going to choose
metaprogramming.  YMMV.

> > However such systems fall into two general categories.
> > The first are clear improvements that you take for granted
> > and use as a base for your expectations and desires.
> > (Consider the lowly FAQ.)
>
> You say that like it's a bad thing (:-).

No, it is not a bad thing.  I'm just addressing the fact that other
people have produced bettter documentation systems in the past.

> > The second are ones that solve problems that someone else
> > thought was important, but you don't.
>
> I don't understand this; please explain.

Here is an example.  TeX solves the problem of how to precisely define
documents (where "precisely" means "down to the wavelength of visible
light") in a format that will remain stable and useable for long
enough to be useful for archival purposes.  (A TeX document written 20
years ago still defines the exact same document today.  How many other
formats can boast that?)  This is a much better documentation system
for those purposes than any that came out before or since.

These goals are not critical for you.  So you don't see TeX as being
massively better than, say, Microsoft Word or PostScript.

> > I'd find this vision far more plausible if you had some
> > clearly laid out use cases ...
>
> Good question.  Let's assume that our user is a developer
> who is diving into an unfamiliar body of code, possibly to
> debug it, make changes, or perform system integration.

Been there.  Done that.

> As noted above, documentation generators integrate text and
> data, presenting a combination of overview material, detailed
> information, relationship diagrams, etc.  This can be quite
> useful in the situations described above.

And I found that what I need to know when I first started on the code
is very different than what I need to know when I have some experience
with it.

> However, existing documentation generators are quite limited.
> They only deal with the kinds of entities and relationships
> that can be found in the source code.  So, for example, they
> ignore topics such as access control, bug reports, data flow,
> test results, etc.  They also ignore existing documentation,
> such as man pages.  As a result, they cannot support browsing
> over the full range of topics that a developer might desire.

It seems to me that every tool you're talking about adding has to be
aware of the documentation system and vice versa.  A prime
consideration is going to have to be how to make it possible to plugin
additional functionality.

> Another use case might involve a system administrator who is
> trying to resolve a configuration problem.  S/he has little or
> no interest in the source code, but access control and data
> flow may be quite relevant.  So, it should be easy to find out
> which files are part of a subsystem, which files are typically
> accessed (in what manner) by which programs and users, etc.

You're not asking for a small job, are you?  In most batch processing
there are a lot of cases where program A deposits data somewhere and
program B picks it up.  Figuring out the dependency requires figuring
out what files program A will write to, and what files program B will
read from.  That sounds like a Turing-complete problem to me.  That or
else you're going to have to trace the program execution.

> Although the documentation system can't discern the reason why
> a file gets read, it can easily detect the absence of any text
> on the subject.  And, if it makes this apparent to developers
> (and convenient to remedy), something might get done about it.

If it is too annoying for developers, it is going to get ignored. 
Unless you can offer the developer compelling benefits for keeping it
up to date.

> Even if the developers fail to provide this information, all
> is not lost.  The user can post a question on the page (it's a
> wiki, after all), causing a notification to be sent out to any
> "interested" parties.  Finally, the user can add comments to
> the documentation, easily and with no administrative time lag.

Your vision looks more applicable to a proprietary enterprise than an
open source project.  A proprietary enterprise also has the ability to
tell developers that they absolutely must spend time on documentation,
making those bits more likely to get done.

In an open source project, developers document for two main reasons. 
The first is so that people will use their project.  The second is to
avoid the annoyance of constantly having to answer the same questions
over and over again.  Providing rich documentation in exactly the way
that users want to see it is somewhat lower priority than feeding the
cat - unless it is made trivial, it won't happen.

> In a final use case, an operator might want a "dashboard" that
> displays summary information on the OS (eg, how much CPU time
> and memory each program has used over the course of the day).
> Something like the "metrics" application described in the rant,
> but with a much simpler set of constraints.

Your documentation project is turning into a magic morphing beast
again that sings, plows fields, dances, does fine sewing, and makes
dinner at the same time.

I hope you're not planning to build it all at once.

[...]
> > A few messages ago you summarized what you want as a system
> > where the developer adds a few comments for the documentation
> > system, everything gets tied together into a model, and then
> > great documentation is produced.
>
> Well, "few" and "great" are your terms.  What I would say is that
> a documentation system can track instances of the entities and
> relationships it "knows about".  It can combine this data with
> any available annotations, generating a page for each instance.

They are my description of how your description came across to me.

> > I look at that vision and I think that programmers are going
> > to have to learn a new (and in the end probably fairly large)
> > meta language to figure out how the meta language works, ...
>
> The Doxygen developers need to understand the vagaries of data
> structures and functions, but a site administrator does not.
> Because the "model" is embodied in the Doxygen program, only
> the input (eg, source code directories) needs to be specified.
> This can take some time, but it isn't all that difficult.

Good luck making it work for Perl or Ruby.

> In my proposed system, the administrator might have more things
> to configure (depending on how smart my installation code is :),
> but s/he shouldn't need a deep understanding of the model.
>
> So, although model development uses a combination of knowledge
> engineering and domain expertise (and a meta-language, as you
> suspect :-), simply installing and using the system shouldn't
> require any of this.

If you think you can do it, I'd suggest ignoring me and going ahead to
implement it.  But my doubts still remain.

[...]
> > ... the gap between the different kinds of documentation that
> > different audiences need does not seem to me to be possible
> > to bridge in any automated kind of way.
>
> Well, there is certainly some truth to that.  A novice may need
> a simplified tutorial, while an expert will want great detail.
> Writing for diverse audiences is challenging for a human writer;
> asking a program to generate this sort of text is AI-complete.

Exactly.  That is why I was suggesting use cases to help me understand
exactly what kind of documentation people would get for what purposes.

BTW if you're going to put down a wishlist, let me add an item.  A
useful document to have and maintain is an index of every email sent
to every user, with an annotation saying what it is, who gets it, and
where it is sent from.  I built one of those at work and got people to
maintain it, and we get a lot of value from that.

> However, this isn't what I'm proposing to do.  Rather, my goal
> is to let the user inspect some entity (eg, document, file,
> program) and find out what other entities are related to it, in
> what ways.  By letting the user follow "interesting" links (and
> perhaps adjust which links are shown), the development system
> can present a "view" of the target system that has the desired
> amount of depth, detail, etc.

Thinking about $work, the only people who could use that are pretty
much the programmers.

[...]
> > The simpler any system is to use, the more likely people are
> > to use it.  The more process that you require from people
> > adding to the system, the less likely people are to [add to]
> > it.
>
> Violent agreement!  Neither the user nor the programmer should
> need to know how the documentation system works.  Programmers
> just (POD-style) add comments to the code.  Users just browse
> pages and (wiki-style) add comments and questions.



> > That's the genius of a wiki - it hits a sweet spot where
> > it provides just enough to be useful, but simply enough
> > that people really use it.
>
> I've been following the literature on Semantic Wikis with a
> great deal of interest.  One of the points that gets made a
> lot is that contributors to (say) Wikipedia aren't going to
> be willing to deal with formal ontologies, topic maps, etc.

I haven't followed that literature, but I'd make the same point.

> So, the typical approach is to add a single "type" attribute
> to the link syntax.  Given that we already know the current
> and destination pages, this completes an RDF triple.  It then
> becomes the job of the wiki (and follow-on) software to make
> sense out of the resulting graph, use it for navigation, etc.

The catch is that different people will use different types for the same thing.

I remember when I worked in finance.  Any given piece of money went by
several different names, and any given name meant several different
things.  For instance the coupon on a bond is the same as the interest
on the loans underlying the bond.  The average life of a bond could be
any of a number of different figures depending which average you were
talking about.

I suspect that every large system is like that.   (Certainly every one
that I've looked at is.)

> Although I think this is a fine plan for Wikipedia, I don't
> think it is an optimal solution for mechanized documentation.
> Whereas Wikipedia's subject area is broad and chaotic, the
> systems I'm talking about documenting are (comparatively :-)
> constrained and well-organized.

Constrained I'll agree with.  The quality of the organization..often lacks. ;-)

[...]
> However, because it IS a wiki, any user can create new pages
> that comment on and/or link to the generated pages, etc.
> This may well produce a "sweet spot" of its own...
>
>
> In closing, I should emphasize that the preceding only gives
> MY notions about how to do mechanized documentation.  Even if
> I'm off track on the specifics, ANY general and useful set of
> tools for mechanized documentation would benefit vendors who
> want their source code to be readily understood.

Well you've ignored my top priority for documentation.  Which is that
when I use documentation, I want a guarantee that I'm seeing
documentation for the version of the program that I'm using.  This
doesn't matter within a proprietary enterprise.  But it does matter
for every piece of software that I use.

Cheers,
Ben