Keeping stuff: How to preserve course papers despite technological change

Front door ... Previous page ... Next page ... Executive summary


HTML

For nine years, I wrote almost all of my course handouts, exercises, and exams in TEX. Then, in 1995, the Department of Mathematics and Computer Science set up the first World Wide Web server at the College, making it much more convenient for us to publish and distribute documents electronically over the Internet. Like RUNOFF source files and TEX source files, a Web document contains markup directives that tell browsers how to structure and format it, in this case for on-screen display. The markup notation in this case is HTML, ``hypertext markup language.''

Since Web documents are exchanged and displayed by all kinds of computers, with different operating systems and different browser software, it was necessary from the earliest days of the Web to develop a public, neutral standard for HTML. This standard has now gone through several revisions and reached a stable form, called XHTML. It is promulgated by a group called the World Wide Web Consortium and is, of course, available on the World Wide Web. To ensure that his document is correctly displayed in any environment, an author need only learn and follow this standard in marking up his documents.

The World Wide Web Consortium even makes available an on-line validator, which reads in any HTML source file and checks to make sure that it is syntactically correct. The validator can be run either before or after the document is made publicly accessible on the Web, and if you happen to be curious about how knowledgeable and careful other authors are, you can validate their Web documents as well as your own.

The idea of publishing electronically strongly appealed to me, and I began writing some of my course materials in HTML in 1996. Here is the earliest one that I have kept in its original form; it's dated January 9, 1996, and has not been changed since then. It has been available continuously on the Web and has been downloaded frequently, since it happens to deal with a topic that comes up in many undergraduate computer-science classes and can be somewhat confusing.

Although most browsers display this document correctly, I am sorry to report that it is not quite valid, even under the HTML standard that was current in 1996. Let's look at the source file that contains the HTML markup. The document is defective in three ways. An HTML document is supposed to begin with a ``document type declaration,'' which basically points to a World Wide Web Consortium document that lists the markup commands that the document uses. The document type declaration also specifies which version of the HTML standard is intended. (Although most of the differences are minor, HTML did go through some substantive changes before reaching its current plateau of stability.)

Secondly, an HTML document should have, somewhere in its header, something to indicate which of the various possible character encodings the file uses. I didn't specify any encoding.

Thirdly, I took advantage of a feature that some browsers provide, allowing me to specify that a particular shade of grayish-white be used as the background against which the text is displayed. This feature is not recognized in the HTML standard.

To correct these defects, I would have to change the document markup in at least two places. Moreover, there were some significant changes in the HTML standard between the date of this document and the adoption of the current version of the standard in 2007, so again I am afraid that technology may defeat my attempt to preserve course materials unchanged through time. It is quite possible that in a few more years browsers may distort Web documents that are invalid or severely obsolete or may even refuse to display them.

However, I wouldn't make any of these mistakes today, and I am hopeful that HTML source files that I have written in the last eleven years, since I began using document type declarations uniformly, will continue to be publishable without change, at least until I retire from teaching. I now use HTML for some course papers, particularly those not containing mathematical formulas, and TEX for the rest. It is possible to distribute documents formatted by TEX through the World Wide Web; however, the recipient must have either an appropriate printer or special viewing software that draws the document on a computer display.

The HTML standard is quite straightforward. Though not short, it can be read and mastered in a few hours. There are many textbooks, tutorials, and reference guides for the assistance of learners. HTML is not extensible, so the number of commands to be learned is finite. These things being so, it is astonishing and disappointing to find that most of the documents on the World Wide Web contain dozens, even hundreds, of HTML errors. Even documents prepared by professionals -- even those at educational institutions -- typically contain numerous easily correctable blunders.

There are several reasons for this unfortunate situation, but the one that is most relevant to the issues that I'm discussing today is that most Web documents are prepared with software tools, such as Microsoft FrontPage, that purposely generate incorrect HTML. Other Microsoft products, such as Internet Explorer, are written to accept and interpret the incorrect constructions, and can therefore display and print Web documents produced by FrontPage. Non-Microsoft browsers, on the other hand, may either distort such documents or fail to render them completely; alternatively, they may just give in and try to copy Internet Explorer's rendering, which is arguably worse.

Since Microsoft frequently adjusts its repertoire of proprietary HTML violations, requiring users of Internet Explorer to upgrade to the latest version in order to be able to read Web documents prepared by Microsoft's HTML-generating applications, I do not expect such documents to have any archival value. It appears that they are designed to break at Microsoft's pleasure and convenience. I remain puzzled by the widespread willingness to use software tools that so obviously work against the preservation of knowledge.


Front door ... Previous page ... Next page ... Executive summary


This document is available on the World Wide Web as

http://www.cs.grinnell.edu/~stone/essays/keeping-stuff/html.xhtml


created March 19, 2001
last revised February 10, 2009

John David Stone (stone@cs.grinnell.edu)