Subject: Re: mechanised documentation and my business model solution
From: Rich Morin <rdm@cfcl.com>
Date: Mon, 27 Mar 2006 17:21:46 -0800

There seems to be agreement that RDBMSs aren't a perfect match
for trees and graphs.  Even so, they have some strengths which
make them worth considering.  They can also be combined with
other (eg, version control, knowledge management) systems, but
whether this is a good idea is still an open question.  OK?


I agree that the version control for the documentation should
be coupled to that of the source code.  However, problems can
emerge as more information sources are brought into play.  For
example, an overview document might be only partially correct
for a given version of the system.  Do we omit it, list it with
a caveat, or what?  How do we track the effects of new versions
on this sort of conceptual material?


BT> If a checkin refers to rt/xxxx, that checkin resolves that
BT> ticket.

This depends on your programmers using a recognizable format
for ticket numbers.  This may not be true for many projects,
but if it works for yours, great!  However, you've overstated
the case a little; a checkin might be related to a ticket, but
not resolve it.  Also, typos can cause incorrect conclusions.
Still, this sort of recognition can be extremely useful; I've
employed it in several projects when hard data was unavailable.


BT> Let's think about a revision control system that took this
BT> attitude.

Actually, the "Rich Man's Source Code Control System" (?? - I
I can't find a citation; it was a talk given at USENIX several
years ago) took the approach of saving entire files, replacing
duplicates with hard links.  It took up more space, to be sure,
but access to old versions was really fast!


BT> ... Times 100 files is 1.2 GB of data.

Which would use about a quarter's worth of (ATA) disk space.
Even adding in the cost of backups and power, and multiplying
by the number and size of projects at SourceForge, this may
not be all _that_ unreasonable an idea.  In addition, it may
be possible to combine compression and indexing (ala mg) in
useful ways.  However, these are implementation details.

If fast access (of any given type) is important, there are
generally ways to get it.  Companies such as Sun would have
no economic problem with making their entire code base, in
multiple versions, fast and convenient to access.  They
simply have to decide that it is worthwhile to do so...


BT> Both JavaDoc and POD wind up displaying text that that
BT> humans explicitly decided to display.

Some systems rely only on human annotations; others examine
the structure of the code (etc) to find relationships.  The
fact that a given system does not currently do the latter is
irrelevant to the question of whether it could (or should).


BT> But all that a Doxygen-like tool is going to see [is] the
BT> existence of the classes and the inheritance relationship
BT> between them.  It will know nothing about what those
BT> classes do. ... That is the reason why Doxygen doesn't
BT> support any dynamic languages.

Gathering dynamic information is problematic, because there is
no guarantee that your "test jig" will exercise all possible
code paths, etc.  Even if you observe some behavior, folding
its implications back into your model is non-trivial to do in
a mechanized fashion.

However, this is no reason to give up on the idea.  Dynamic
languages often have great support for introspection, which
can extract all sorts of structural information as a program
is being run.  The DTrace facility in Solaris 10 can track
arbitrary function and system calls, limiting the results by
user, process, etc.  This could be used to perform "black box"
characterization of file activity, forks and execs, etc.  In
short, there are lots of ways to collect dynamic information.


BT> Here is an example.  TeX solves the problem of how to ...

TeX is a marvelous piece of work.  I have used it, in fact,
for some mechanized publishing projects.  I also use groff,
Graphviz, etc.  I don't discount the importance of these tools;
indeed, I rely heavily on them as infrastructure.


BT> And I found that what I need to know when I first started
BT> on the code is very different than what I need to know
BT> when I have some experience with it.

In an explanatory document, this is an intractable problem.  A
document that works well for beginners won't be suitable for
experts.  So, writers create multiple documents, with each one
having a limited audience, level, scope, etc.

Generating pages that show different "views" of the data is not
that difficult, however.  A presentation systems can let users
control which information is displayed, in what manner, etc.
True, this can run into the problems mentioned in the rant, but
useful subsets of the problem are quite tractable.


BT> It seems to me that every tool you're talking about adding
BT> has to be aware of the documentation system and vice versa.
BT> A prime consideration is going to have to be how to make it
BT> possible to plug in additional functionality.


Definitely.  Everyone agrees that modularity is critical, but
nobody can agree on how to achieve it!  Trying to deal with
special-purpose APIs and protocols can be an enormous time sink.
So, I've been looking into the idea of using RDBMS tables as an
interchange format:

  Using DBMS tables for inter-application communication
  http://www.cfcl.com/~rdm/weblog/archives/000999.html


BT> You're not asking for a small job, are you?

RM> Again, any "interesting" problem sits on the line between
RM> "trivial" and "impossible".


BT> Figuring out the dependency requires figuring out what
BT> files program A will write to, and what files program B
BT> will read from.

As noted above, DTrace can monitor system calls, extracting
this sort of information.  The results may be incomplete,
because the test case may not exercise all of the code paths,
but a partial solution is better than none.  And, because the
system can accept information from users, the results can be
augmented by anyone who has a clue to offer.


BT> If it is too annoying for developers, it is going to get
BT> ignored.

Or disabled.  But developers should have a way to check for
completeness of documentation, just as they can ask for all
sorts of compiler warnings, etc.


BT> Unless you can offer the developer compelling benefits
BT> for keeping it up to date.

There are various ways to motivate developers, ranging from
egoboo and professionalism to good reviews and pay increases.
In an Open Source environment, the first two may suffice.  In
some projects, excellent documentation is produced for these
reasons.  If it were easier, perhaps more would be...


BT> A proprietary enterprise also has the ability to tell
BT> developers that they absolutely must spend time on
BT> documentation, making those bits more likely to get done.

You have apparently been working in different companies than
I have (:-).  My experience is that most companies talk about
wanting documentation, but mostly reward programmers for code.


BT> In an open source project, developers document for two
BT> main reasons.  The first is so that people will use their
BT> project.  The second is to avoid the annoyance of constantly
BT> having to answer the same questions over and over again.

Agreed, modulo the other motives noted above.


BT> Providing rich documentation in exactly the way that users
BT> want to see it is somewhat lower priority than feeding the
BT> cat - unless it is made trivial, it won't happen.

Well, there is no guarantee that documentation can be provided
"in exactly the way that users want to see it".  That's your
straw man, not mine.  However, it may be possible to improve on
the current situation, without great burden on the programmers.


BT> Your ... project is turning into a magic morphing ...

There are many ways to collect and display information.  These
may involve static or dynamic information.  A general solution
should be able to add new inputs and outputs without great pain.
This is not to say that I know how to fulfill the expectations
described in the rant; I don't.  But a general and extensible
documentation system is not an unrealistic goal.


BT> I hope you're not planning to build it all at once.

I've been working on these issues, off and on, for two decades.
When I first started looking at the problem, everything would
have needed to be invented, coded from scratch, etc.  Now, it's
mostly a matter of finding suitable technologies and plugging
them together.  I've already implemented some special-purpose
variations on these themes.  Given time, I'll get there...


BT> Good luck making it work for Perl or Ruby.

As noted above, dynamic languages present some real problems,
but have introspective abilities that may offer solutions.  In
any case, many systems are comprised of multiple programs and
files.  Even a model at that level of granularity is useful.


BT> If you think you can do it, I'd suggest ignoring me and
BT> going ahead to implement it.  But my doubts still remain.

Doubts are fine, as is constructive criticism.  Although I'm
pretty confident that I can do most of what I claim, I have
less confidence that I'm solving the right problem in the most
optimal fashion.  But then, that's why they call it research.


BT> BTW, if you're going to put down a wishlist, let me add an
BT> item.  A useful document to have and maintain is an index
BT> of every email sent to every user, with an annotation
BT> saying what it is, who gets it, and where it is sent from.
BT> I built one of those at work and got people to maintain it,
BT> and we get a lot of value from that.

Privacy and legal considerations aside, I like the idea.  I've
got years worth of saved email online; it's incredibly useful
for finding email addresses, looking up topics, etc.  BTW, I'm
also a big fan of HistoryHound, which indexes my visited URLs
and bookmarks.


If all of the email is online (either as files or on the web),
a program such as SWISH-E will provide indexing and full-text
search.  It can even be tweaked to subset by sender, date, etc.

However, it won't handle threads and other relationships; it's
just a search tool, after all.  If I were going to add email to
a documentation system, I would look for entities, attributes,
and relationships that are worth tracking and presenting.


RM> The catch is that different people will use different types
RM> for the same thing.

That's one of the points that Clay Shirky makes in debunking the
Semantic Web.  And, from what I see in assorted literature and
email lists (eg, Common Logic, Conceptual Graphs, Protege, Topic
Maps), I'd agree that defining ontologies is quite challenging.

Although I wouldn't expect the typical Wikipedia contributor to
be able to define good categories, they should be able to pick
the right ones to use, much of the time.  And, if they get it
wrong, wikis are easy to update (:-).

BT> Well you've ignored my top priority for documentation.  ...
BT> a guarantee that I'm seeing documentation for the version
BT> of the program that I'm using.

I wasn't really ignoring it, but you can rest assured that I
will give it higher priority from now on!

-r
-- 
http://www.cfcl.com/rdm            Rich Morin
http://www.cfcl.com/rdm/resume     rdm@cfcl.com
http://www.cfcl.com/rdm/weblog     +1 650-873-7841

Technical editing and writing, programming, and web development