First steps in HTML

for the HTML Workshop

April 19, 1995 * Grinnell College


Part 1: Word processors, text editors, and markup

From the author's point of view, a document in the Hypertext Markup Language (HTML) is a text file containing ordinary prose with exotic-looking decorations. Most of the file consists of ordinary words, lines, and paragraphs of text, but they are interspersed with markup: special symbols that direct the activity of some computer program that will eventually display or print the document, beautifully formatted.

If you use a word-processing program like Word Perfect or Microsoft Word, you may know that such programs insert their own markup into documents more or less automatically, so that they too can display and print the documents beautifully formatted. Normally, word-processing markup is invisible, although if you're an experienced user you may know a command that makes it possible to see and edit the markup as well as the text. However, the markup is present even when you don't see it, and for a good reason: The word-processing program is not clever enough to deduce, from the text itself, how to lay out the words on a page appropriately. The markup is needed because it controls the formatting.

HTML documents must also include markup. There are some special-purpose editors that insert HTML-style markup automatically; one of them is HoTMetaL. (A Windows implementation of HoTMetaL, copyrighted but free to academic institutions, can be obtained from the National Center for Supercomputing Applications [NCSA] at the University of Illinois at Urbana-Champaign.) But most of you will initially be writing your HTML documents with text editors rather than word-processing programs, and so you will have to insert the markup by hand, typing it in along with the rest of the text. On MathLAN, the GNU Emacs and vuepad editors belong to this category.

In a text editor, the markup is always visible, and you have to learn to read around and through it if you're trying to look at the text. Fortunately, the markup is not visible when you're displaying the page with an HTML display program. World Wide Web client programs such as Mosaic or Netscape can be used for this purpose, since one of their principal functions is to display WWW pages, which are usually written as HTML documents. When you're editing an HTML document on a computer with a large enough monitor and high enough resolution, it's useful to keep two windows open side by side--one containing the editor, the other the HTML display program. By saving the document in the editor and reloading it into the display program, you can see the effects of your changes as you work, almost as if you were using a word processor instead of a text editor.

1.1. Tags

HTML-style markup consists mostly of tags. Most tags come in pairs, a start tag and an end tag, to be placed at either end of a region of text that is to receive some kind of special formatting. For instance, when a word in the middle of a stretch of text is to be emphasized, the start tag <em> is placed at the beginning of the word and the corresponding end tag </em> at the end, thus:

<em>emphasis</em>

Every tag is enclosed by the characters < and >. The fact that these characters have special meanings implies that you can't use them in their ordinary roles. In most documents, they won't be missed; if you absolutely have to be able to display them, type the four-character sequence &lt; into your HTML file to get <, and the four-character sequence &gt; to get >. Since this convention makes the ampersand character, &, special as well, you have to type the five-character sequence &amp; to get it. Similar character sequences can be used to for accented letters; for instance, type &ntilde; to get ñ.

End tags always look like the corresponding start tags except that in an end tag there is a slash character immediately after the <. Here are some other pairs of tags that result in special type styles for the text they enclose:

<strong>strong emphasis</strong>
<i>italics</i>
<b>boldface</b>
<tt>typewriter font</tt>

Other pairs of tags indicate that some stretch of text is to be configured on the page in some special way. For instance, each paragraph of your document should be placed between the tags <p> and </p>, and a long, indented quotation in the middle of a document should be set off by the tags <blockquote> and </blockquote>. Here's how a passage containing a block quotation might look in a text editor after you've typed it in:

<p>
In its 1994-95 catalog, Grinnell College describes its
mission thus:
</p>

<blockquote>
<p>
Grinnell College is committed to liberal education in the
arts and sciences.  Seeing knowledge as an end to be
pursued for its own sake and acknowledging the sense of
achievement and the pleasure that comes with learning,
Grinnell wants students to experience the confidence that
proceeds from thinking clearly, logically, and imaginatively.
</p>
</blockquote>

<p>
When knowledge is pursued for its own sake, what is the
effect on the pursuer?  Does one become wiser or happier from
the pursuit itself, or only from the achievement of its goal?
The founders of the College must have recognized that ...
</p>

Here's how the preceding passage looks when formatted:

In its 1994-95 catalog, Grinnell College describes its mission thus:

Grinnell College is committed to liberal education in the arts and sciences. Seeing knowledge as an end to be pursued for its own sake and acknowledging the sense of achievement and the pleasure that comes with learning, Grinnell wants students to experience the confidence that proceeds from thinking clearly, logically, and imaginatively.

When knowledge is pursued for its own sake, what is the effect on the pursuer? Does one become wiser or happier from the pursuit itself, or only from the achievement of its goal? The founders of the College must have recognized that ...

Normally, text is displayed left-justified; however, some HTML display programs will perform center justification on text placed between the tags <center> and </center>. (Others will ignore these tags.)

Part 2: The structure of an HTML document

A computer program that can process HTML formatting commands sees the entire document as consisting of regions of text delimited by tags. Let's look now at the overall structure of an HTML document.

The first and last lines of an HTML document are the tags <html> and </html>, which mark the entire text as something than an HTML processor can expect to be able to interpret.

Between these tags, there are two other pairs: <head> and </head>, and <body> and </body>. The regions that these pairs of tags mark off must not overlap, and usually they will together take in the entire document, giving it this structure: 4

<html>
<head>

... (header information) ...

</head>
<body>

... (all the interesting text) ...

</body>
</html>

These tags will appear, in this order, in every HTML document you write.

2.1. The header

The header information (anything placed between <head> and </head>) is not displayed as part of the document, but can be used by other software tools (such as programs for automatic indexing or compilation of bibliographies) to obtain information about the document. For example, regardless of whether you want to display a title at the beginning of your document, you can (and should) include a title in the header information, so that both human readers and computer programs will have a concise way to refer to the document. Place the title in the header information region, enclosing it between the tags <title> and </title>. For instance, the line

<title>First steps in HTML</title>

appears in the header information for this handout.

I recommend that you also include a by-line in the header information. Mine looks like this:

<link rev="made" href="mailto:stone@cs.grinnell.edu">

This is one long tag (you can tell because it's all enclosed in one set of angle brackets). Unlike the tags we've seen before, it does not start or end a region of text; it stands alone. Also, unlike the tags discussed above, a <link> tag can incorporate "attributes" that qualify or supplement the tag's basic meaning.

The <link> tag that I use for my by-line has two attributes, rev and href. To incorporate an attribute into a tag, you write an equation inside the tag that has the name of the attribute on the left and a word or phrase enclosed in quotation marks (conventionally, double quotation marks) on the right-hand side.

Href stands for "hypertext reference," and the phrase that I've placed between the quotation marks on the right side of the href equation is my electronic-mail address; it is preceded by mailto: to indicate that someone who follows the reference that I've provided--who "moves through" that hypertext link--will send an e-mail message to me. (From the point of view of a reader, activating that link causes a window to appear, into which the reader can type the message that she wishes to send.)

The rev attribute describes the relationship between the person or object to which the hypertext reference refers (me) and the current document. The phrase rev="made" indicates that I am the author of the document and corresponds to the word "by" at the top of a magazine article.

2.2. The body

Now for the body of the document. The internal structure of each document is determined by its author, but HTML provides some support for the classical structure (chapters, sections, subsections) by formatting the titles and headings of these components. Actually, HTML recognizes six levels of structure, and provides a different kind of format for the title and heading of each one. The levels are numbered from 1 to 6, with 1 being the document as a whole, 2 being its major divisions (chapters, perhaps), 3 being the subdivisions (sections), and so on down.

The title of the entire document, if it is to be displayed or printed at the top, should be enclosed by the tags <h1> and </h1> ("level-one heading"). A level-one heading is printed in enormous type and set off with white space above and below. It need not appear at the beginning of the body; the same font and heading style will be used wherever it appears. Moreover, you can have as many level-one headings as you like; HTML does not force any particular internal structure on you.

The titles of smaller components are enclosed in analogous pairs of tags (<h2> and </h2>, <h3> and </h3>, and so on down to <h6> and </h6>). At each level, the font gets smaller and the heading less impressive. If you find the level-one heading font excessive, you can start off at one of the lower levels; for instance, the title at the top of this document is actually written as a level-two heading. However, it is regarded as bad style to descend more than one level at a time (a subheading under a level-two heading should be at level three, not level four).

HTML display programs like Mosaic and Netscape perform their own word wrapping and justification ("filling") of lines. They typically try to put as many words onto a line as possible before going on to the next line. If you want to prevent this, you can insert the stand-alone tag <br> at any point in the text; an HTML display program will then terminate the line right there and resume on the next line. Alternatively, if you have a long stretch of carefully formatted lines, you can surround that region with the tags <pre> and </pre>, and the HTML display will respect all of your line breaks. (It will also switch to some fixed-spacing font, such as a typewriter font, so that characters that are vertically aligned in your carefully formatted input are also vertically aligned on screen.)

Except between <pre> and </pre>, HTML ignores white space, such as multiple spaces, tab characters, and blank lines. You must therefore insert markup every time you want to begin a new paragraph. (It is not necessary, however, to insert <p> before or after a heading, and some authorities consider it poor style.)

One other way of signalling a break in the flow of text is available: the horizontal rule. The standalone tag <hr> causes the HTML display program to draw a line across the display page where it occurs. (There is a horizontal rule near the beginning of this document, just above the "Part 1" header; there are three more near the end.)

2.3. Lists

Word-wrapping and justification are also suspended in one other situation. You can indicate that you want a list of items, with each item starting on a new line, by placing the start tag <ul> at the beginning of the list, the item tag <li> before each item, and the end tag </ul> after the last one. The result looks like this in the editor:

<ul>
<li>First item
<li>Second item
<li>Third item
<li>Last item
</ul>

and like this when displayed:

Notice that each item is preceded by a bullet and indented. Numerals will appear instead of bullets if you use the tags <ol> and </ol> at the beginning and end of the list (ol = ordered list; ul = unordered list).

Part 3: Anchors and links

Any word or phrase in an HTML document can be made into a hypertext link to an Internet resource, such as a WWW page (probably another HTML document), a Gopher server, an ftp server, a telnet session, or an electronic-mail address. (Don't worry if you've never heard of Gopher, ftp, and telnet. You can learn about them if and when you need them.)

The word or phrase through which the link can be activated (the anchor for the link) is enclosed by the tags <a> and </a>. Conventionally, both tags are placed right up next to the anchor, with no intervening blanks, thus:

<a href="http://www.cs.grinnell.edu/home.html">This is the anchor.</a>

The start tag for the anchor always includes an equation for the href attribute. The right-hand side of this equation is a Universal Resource Locator (URL) and specifies how to get to whatever is at the other end of the link. The http: part specifies the kind of connection that will be made when the link is activated (the "Hypertext Transfer Protocol" will be used); in its place, one might alternatively write

mailto: for an electronic-mail connection,
gopher: to connect to a Gopher server,
telnet: to open a telnet session, or
ftp: to connect to an ftp server.

The //www.cs.grinnell.edu/home.html part of the tag says that the document at the other end of the link is stored in a file named home.html on a machine known on the Internet as www.cs.grinnell.edu. When the link is activated (perhaps by a reader who moves the mouse pointer onto the anchor and presses a mouse button), the reader's computer will send a request for the document to www.cs.grinnell.edu, which will respond by sending it back to be displayed.

In order to write the start tag for an anchor, you'll need the URL of the document that you want to link to. Often it's a document that you've found by searching on the World Wide Web; if so, your WWW client software probably displays the document's URL when you are looking at the page, and you can just copy it into the tag.

Sometimes, however, you want to link to another HTML document that you have written and stored in a file somewhere. The details of the URL that you'll need vary from machine to machine. For example, on www.cs.grinnell.edu, an author with the username spelvin would refer to a file named frogs.html in his public_html subdirectory by means of the URL

http://www.cs.grinnell.edu/~spelvin/frogs.html

Here ~spelvin/ is a concise way to refer to the directory (the folder, that is) in which the user named spelvin keeps his publicly accessible HTML files. On www.cs.grinnell.edu, this is conventionally the public_html directory. Various machines name and manage files in various ways, so you'll have to learn the conventions of your particular World Wide Web server--the machine through which your HTML documents will be made available.

3.1. Links within a document

If a document is long, you may want to include, in one passage, a hypertext reference to another passage that is much earlier or much later in the same document. For instance, this is one way of including footnotes in a document; textually, they can all be placed at the end, but the main body of the document can include forward references to them, and the curious reader who wants to jump ahead to the footnote can activate the link by clicking on the anchor. In such a case, the href attribute can be much simpler. Here's what a passage containing such an anchor looks like in your text editor:

As Richard Mitchell says in <i>The gift of fire</i>, "Problem-solving is a wonderful device, and fun, but it ought to be kept in its place" <a href="#mitchell">(note 1)</a>.

And here is what it looks like in the display program:

As Richard Mitchell says in The gift of fire, "Problem-solving is a wonderful device, and fun, but it ought to be kept in its place" (note 1).

Here mitchell is a name for the point in the text that you want to have at the other end of the hypertext reference. You can choose the name of an internal hypertext reference to suit yourself. In a hypertext reference, the mesh character, #, is placed before the name.

In this case, naturally, you're also responsible for the other end of the connection. At the point in the text that is being referred to in the anchor shown above, you insert a second anchor, this one with a name attribute instead of an href attribute. The one in this document looks like this:

<a name="mitchell">1</a>

This makes the numeral 1 into an anchor, and gives that anchor the name mitchell so that it can be used as a target for hypertext references elsewhere in the document. (Clicking on an <a name=...> anchor has no effect, since there's no link in the anchor for the hypertext system to follow.)

Part 4: Graphics

A graphic can be associated with an HTML document in either of two ways. An in-line graphic becomes part of the display, right along with the text (that is, in the same display window). You can see the effect on my "front-door" page (http://www.cs.grinnell.edu/~stone/ ), which includes, as an in-line graphic, a photograph taken with a digital camera by Margaret Rauber of the Grinnell College Office of Public Relations. This is the best way of integrating the graphic with the text.

The <img> tag is used to incorporate an in-line graphic into an HTML document. The graphic should be stored in a separate file, and you write the URL for that file on the right-hand side of the equation for the src ("source") attribute of the <img> tag. For example, the URL

http://www.cs.grinnell.edu/traditions/Noyce-photo.gif

contains a GIF-format black-and-white photograph of Robert N. Noyce '49. To include that photograph in an HTML document, one would write

<img src="http://www.cs.grinnell.edu/traditions/Noyce-photo.gif">

at the appropriate point of insertion.

Unfortunately, in-line graphics must be in one of two formats: GIF ("Graphics Interchange Format") or XBM (X Windows bitmap format). If the image you want to attach to your HTML document is in some other format, and you can't or don't want to convert it, you can still put a link to it into your document, making it an external graphic. When a reader's HTML display program encounters a link to an external graphic, it looks around for an auxiliary program called a viewer that can be used to display such graphics. If the HTML display program finds an appropriate viewer, it starts the viewer running, recovers the graphic from whatever machine is providing it, and hands it over to the viewer, which displays it in another window. If the reader's HTML display program can't find a viewer, she's out of luck; you've made the graphic available, but she doesn't have the resources to look at it.

This means that in addition to the HTML display program, a computer that is equipped to see all that there is to see on the World Wide Web must have many viewers, and the HTML display program has to be able to find them and start them up automatically. How to arrange this depends on what kind of a machine you have. The good news is that for educational institutions viewers are often freeware. See the NCSA Mosaic Frequently Asked Questions page for more details.

For an external graphic, you don't use <img>; instead, you provide an anchor in the text with a hypertext reference to the file containing the graphic. When the reader activates the link, the appropriate viewer is started if it is available. For instance, if you're reading this through an HTML display program on a machine that also has a viewer for TIFF-format files, you can see the Noyce photograph at full size by clicking on the anchor in this sentence. Here is what the anchor and the associated markup look like:

<a href="http://www.cs.grinnell.edu/traditions/Noyce-photo.tiff">Noyce photograph</a>

Files containing audio recordings, movies, documents prepared by other formatting systems, and so on are treated as external graphics: If the reader's computer has the appropriate software to display, emit, or play the contents of the file, the hypertext reference will run it.

Part 5: Adapting existing documents

You may already have a stock of worthwhile documents, prepared before you learned about HTML, that you would like to make available through the World Wide Web. You would prefer not to rewrite them from scratch; ideally, you'd rather not even retype them. Is there any way to avoid inserting all the HTML markup?

The answer is yes. Programs like Mosaic and Netscape expect HTML markup only in files that contain the tags <html> and </html>. Text files lacking these tags will still be displayed, but in a no-frills format: a fixed-width font, no fancy headings, no graphics, no links.

However, if you created your files with Word Perfect or Microsoft Word, you will have to remove all the alien markup associated with those programs from your documents in order for Mosaic or Netscape to display them. Check the documentation for your word processing software; it may well provide a way to do this automatically. You want the files to be stored in "ASCII text" format.

Eventually, new versions of these well-known word-processing programs will include an "HTML mode," so that you can use them to prepare HTML documents. There will also be conversion programs that translate, say, Word Perfect markup into HTML markup. One potentially useful conversion program that is already available is rtftohtml. If you save your Word Perfect or Microsoft Word document in as a "Rich Text Format" (RTF) file, rtftohtml will strip out the RTF markup and put HTML markup in its place. (That's the claim, anyway; I haven't used this program myself and cannot confirm that it performs as advertised.) You can find out more about rtftohtml at the WWW page

ftp://ftp.cray.com/src/WWWstuff/RTF/rtftohtml_overview.html

Part 6: Beyond the first steps

There are many World Wide Web pages dealing with HTML. Here are a few that I found useful while I was preparing for this workshop:

Each of these pages contains links to still more documentation.

I also consulted


1. Richard Mitchell, The gift of fire (New York: Simon & Schuster, Inc., 1987), p. 84. If you're reading this document electronically, you may have reached this footnote by clicking on the anchor "(note 1)" earlier in this document. You came here because the numeral at the beginning of this note is enclosed in the tags <a name="mitchell"> and </a>. Click here to return to that earlier point in the document.


This document is available on the World Wide Web as

http://www.cs.grinnell.edu/~stone/events/html-workshop/first-steps.html


created April 11, 1995
last revised December 15, 2003

John David Stone (stone@cs.grinnell.edu)