There seems to be agreement that RDBMSs aren't a perfect match for trees and graphs. Even so, they have some strengths which make them worth considering. They can also be combined with other (eg, version control, knowledge management) systems, but whether this is a good idea is still an open question. OK? I agree that the version control for the documentation should be coupled to that of the source code. However, problems can emerge as more information sources are brought into play. For example, an overview document might be only partially correct for a given version of the system. Do we omit it, list it with a caveat, or what? How do we track the effects of new versions on this sort of conceptual material? BT> If a checkin refers to rt/xxxx, that checkin resolves that BT> ticket. This depends on your programmers using a recognizable format for ticket numbers. This may not be true for many projects, but if it works for yours, great! However, you've overstated the case a little; a checkin might be related to a ticket, but not resolve it. Also, typos can cause incorrect conclusions. Still, this sort of recognition can be extremely useful; I've employed it in several projects when hard data was unavailable. BT> Let's think about a revision control system that took this BT> attitude. Actually, the "Rich Man's Source Code Control System" (?? - I I can't find a citation; it was a talk given at USENIX several years ago) took the approach of saving entire files, replacing duplicates with hard links. It took up more space, to be sure, but access to old versions was really fast! BT> ... Times 100 files is 1.2 GB of data. Which would use about a quarter's worth of (ATA) disk space. Even adding in the cost of backups and power, and multiplying by the number and size of projects at SourceForge, this may not be all _that_ unreasonable an idea. In addition, it may be possible to combine compression and indexing (ala mg) in useful ways. However, these are implementation details. If fast access (of any given type) is important, there are generally ways to get it. Companies such as Sun would have no economic problem with making their entire code base, in multiple versions, fast and convenient to access. They simply have to decide that it is worthwhile to do so... BT> Both JavaDoc and POD wind up displaying text that that BT> humans explicitly decided to display. Some systems rely only on human annotations; others examine the structure of the code (etc) to find relationships. The fact that a given system does not currently do the latter is irrelevant to the question of whether it could (or should). BT> But all that a Doxygen-like tool is going to see [is] the BT> existence of the classes and the inheritance relationship BT> between them. It will know nothing about what those BT> classes do. ... That is the reason why Doxygen doesn't BT> support any dynamic languages. Gathering dynamic information is problematic, because there is no guarantee that your "test jig" will exercise all possible code paths, etc. Even if you observe some behavior, folding its implications back into your model is non-trivial to do in a mechanized fashion. However, this is no reason to give up on the idea. Dynamic languages often have great support for introspection, which can extract all sorts of structural information as a program is being run. The DTrace facility in Solaris 10 can track arbitrary function and system calls, limiting the results by user, process, etc. This could be used to perform "black box" characterization of file activity, forks and execs, etc. In short, there are lots of ways to collect dynamic information. BT> Here is an example. TeX solves the problem of how to ... TeX is a marvelous piece of work. I have used it, in fact, for some mechanized publishing projects. I also use groff, Graphviz, etc. I don't discount the importance of these tools; indeed, I rely heavily on them as infrastructure. BT> And I found that what I need to know when I first started BT> on the code is very different than what I need to know BT> when I have some experience with it. In an explanatory document, this is an intractable problem. A document that works well for beginners won't be suitable for experts. So, writers create multiple documents, with each one having a limited audience, level, scope, etc. Generating pages that show different "views" of the data is not that difficult, however. A presentation systems can let users control which information is displayed, in what manner, etc. True, this can run into the problems mentioned in the rant, but useful subsets of the problem are quite tractable. BT> It seems to me that every tool you're talking about adding BT> has to be aware of the documentation system and vice versa. BT> A prime consideration is going to have to be how to make it BT> possible to plug in additional functionality. Definitely. Everyone agrees that modularity is critical, but nobody can agree on how to achieve it! Trying to deal with special-purpose APIs and protocols can be an enormous time sink. So, I've been looking into the idea of using RDBMS tables as an interchange format: Using DBMS tables for inter-application communication http://www.cfcl.com/~rdm/weblog/archives/000999.html BT> You're not asking for a small job, are you? RM> Again, any "interesting" problem sits on the line between RM> "trivial" and "impossible". BT> Figuring out the dependency requires figuring out what BT> files program A will write to, and what files program B BT> will read from. As noted above, DTrace can monitor system calls, extracting this sort of information. The results may be incomplete, because the test case may not exercise all of the code paths, but a partial solution is better than none. And, because the system can accept information from users, the results can be augmented by anyone who has a clue to offer. BT> If it is too annoying for developers, it is going to get BT> ignored. Or disabled. But developers should have a way to check for completeness of documentation, just as they can ask for all sorts of compiler warnings, etc. BT> Unless you can offer the developer compelling benefits BT> for keeping it up to date. There are various ways to motivate developers, ranging from egoboo and professionalism to good reviews and pay increases. In an Open Source environment, the first two may suffice. In some projects, excellent documentation is produced for these reasons. If it were easier, perhaps more would be... BT> A proprietary enterprise also has the ability to tell BT> developers that they absolutely must spend time on BT> documentation, making those bits more likely to get done. You have apparently been working in different companies than I have (:-). My experience is that most companies talk about wanting documentation, but mostly reward programmers for code. BT> In an open source project, developers document for two BT> main reasons. The first is so that people will use their BT> project. The second is to avoid the annoyance of constantly BT> having to answer the same questions over and over again. Agreed, modulo the other motives noted above. BT> Providing rich documentation in exactly the way that users BT> want to see it is somewhat lower priority than feeding the BT> cat - unless it is made trivial, it won't happen. Well, there is no guarantee that documentation can be provided "in exactly the way that users want to see it". That's your straw man, not mine. However, it may be possible to improve on the current situation, without great burden on the programmers. BT> Your ... project is turning into a magic morphing ... There are many ways to collect and display information. These may involve static or dynamic information. A general solution should be able to add new inputs and outputs without great pain. This is not to say that I know how to fulfill the expectations described in the rant; I don't. But a general and extensible documentation system is not an unrealistic goal. BT> I hope you're not planning to build it all at once. I've been working on these issues, off and on, for two decades. When I first started looking at the problem, everything would have needed to be invented, coded from scratch, etc. Now, it's mostly a matter of finding suitable technologies and plugging them together. I've already implemented some special-purpose variations on these themes. Given time, I'll get there... BT> Good luck making it work for Perl or Ruby. As noted above, dynamic languages present some real problems, but have introspective abilities that may offer solutions. In any case, many systems are comprised of multiple programs and files. Even a model at that level of granularity is useful. BT> If you think you can do it, I'd suggest ignoring me and BT> going ahead to implement it. But my doubts still remain. Doubts are fine, as is constructive criticism. Although I'm pretty confident that I can do most of what I claim, I have less confidence that I'm solving the right problem in the most optimal fashion. But then, that's why they call it research. BT> BTW, if you're going to put down a wishlist, let me add an BT> item. A useful document to have and maintain is an index BT> of every email sent to every user, with an annotation BT> saying what it is, who gets it, and where it is sent from. BT> I built one of those at work and got people to maintain it, BT> and we get a lot of value from that. Privacy and legal considerations aside, I like the idea. I've got years worth of saved email online; it's incredibly useful for finding email addresses, looking up topics, etc. BTW, I'm also a big fan of HistoryHound, which indexes my visited URLs and bookmarks. If all of the email is online (either as files or on the web), a program such as SWISH-E will provide indexing and full-text search. It can even be tweaked to subset by sender, date, etc. However, it won't handle threads and other relationships; it's just a search tool, after all. If I were going to add email to a documentation system, I would look for entities, attributes, and relationships that are worth tracking and presenting. RM> The catch is that different people will use different types RM> for the same thing. That's one of the points that Clay Shirky makes in debunking the Semantic Web. And, from what I see in assorted literature and email lists (eg, Common Logic, Conceptual Graphs, Protege, Topic Maps), I'd agree that defining ontologies is quite challenging. Although I wouldn't expect the typical Wikipedia contributor to be able to define good categories, they should be able to pick the right ones to use, much of the time. And, if they get it wrong, wikis are easy to update (:-). BT> Well you've ignored my top priority for documentation. ... BT> a guarantee that I'm seeing documentation for the version BT> of the program that I'm using. I wasn't really ignoring it, but you can rest assured that I will give it higher priority from now on! -r -- http://www.cfcl.com/rdm Rich Morin http://www.cfcl.com/rdm/resume rdm@cfcl.com http://www.cfcl.com/rdm/weblog +1 650-873-7841 Technical editing and writing, programming, and web development