From the author's point of view, a document in the Hypertext Markup Language (HTML) is a text file containing ordinary prose with exotic-looking decorations. Most of the file consists of ordinary words, lines, and paragraphs of text, but they are interspersed with markup: special symbols that direct the activity of some computer program that will eventually display or print the document, beautifully formatted.
If you use a word-processing program like Word Perfect or Microsoft Word, you may know that such programs insert their own markup into documents more or less automatically, so that they too can display and print the documents beautifully formatted. Normally, word-processing markup is invisible, although if you're an experienced user you may know a command that makes it possible to see and edit the markup as well as the text. However, the markup is present even when you don't see it, and for a good reason: The word-processing program is not clever enough to deduce, from the text itself, how to lay out the words on a page appropriately. The markup is needed because it controls the formatting.
HTML documents must also include markup. There are some special-purpose editors that insert HTML-style markup automatically; one of them is HoTMetaL. (A Windows implementation of HoTMetaL, copyrighted but free to academic institutions, can be obtained from the National Center for Supercomputing Applications [NCSA] at the University of Illinois at Urbana-Champaign.) But most of you will initially be writing your HTML documents with text editors rather than word-processing programs, and so you will have to insert the markup by hand, typing it in along with the rest of the text. On MathLAN, the GNU Emacs and vuepad editors belong to this category.
In a text editor, the markup is always visible, and you have to learn to read around and through it if you're trying to look at the text. Fortunately, the markup is not visible when you're displaying the page with an HTML display program. World Wide Web client programs such as Mosaic or Netscape can be used for this purpose, since one of their principal functions is to display WWW pages, which are usually written as HTML documents. When you're editing an HTML document on a computer with a large enough monitor and high enough resolution, it's useful to keep two windows open side by side--one containing the editor, the other the HTML display program. By saving the document in the editor and reloading it into the display program, you can see the effects of your changes as you work, almost as if you were using a word processor instead of a text editor.
HTML-style markup consists mostly of tags. Most tags come in
pairs, a start tag and an end tag, to be placed at either end of a region
of text that is to receive some kind of special formatting. For instance,
when a word in the middle of a stretch of text is to be
emphasized, the start tag <em> is placed at the
beginning of the word and the corresponding end tag
</em> at the end, thus:
<em>emphasis</em>
Every tag is enclosed by the characters < and
>. The fact that these characters have special meanings
implies that you can't use them in their ordinary roles. In most
documents, they won't be missed; if you absolutely have to be able to
display them, type the four-character sequence < into
your HTML file to get <, and the four-character sequence
> to get >. Since this convention
makes the ampersand character, &, special as well, you
have to type the five-character sequence & to get it.
Similar character sequences can be used to for accented letters; for
instance, type ñ to get ñ.
End tags always look like the corresponding start tags except that in an
end tag there is a slash character immediately after the <.
Here are some other pairs of tags that result in special type styles for the
text they enclose:
<strong>strong emphasis</strong>
<i>italics</i>
<b>boldface</b>
<tt>typewriter font</tt>
Other pairs of tags indicate that some stretch of text is to be configured
on the page in some special way. For instance, each paragraph of your
document should be placed between the tags <p> and
</p>, and a long, indented quotation in the middle of a
document should be set off by the tags <blockquote> and
</blockquote>. Here's how a passage containing a block
quotation might look in a text editor after you've typed it in:
<p> In its 1994-95 catalog, Grinnell College describes its mission thus: </p> <blockquote> <p> Grinnell College is committed to liberal education in the arts and sciences. Seeing knowledge as an end to be pursued for its own sake and acknowledging the sense of achievement and the pleasure that comes with learning, Grinnell wants students to experience the confidence that proceeds from thinking clearly, logically, and imaginatively. </p> </blockquote> <p> When knowledge is pursued for its own sake, what is the effect on the pursuer? Does one become wiser or happier from the pursuit itself, or only from the achievement of its goal? The founders of the College must have recognized that ... </p>
Here's how the preceding passage looks when formatted:
In its 1994-95 catalog, Grinnell College describes its mission thus:
Grinnell College is committed to liberal education in the arts and sciences. Seeing knowledge as an end to be pursued for its own sake and acknowledging the sense of achievement and the pleasure that comes with learning, Grinnell wants students to experience the confidence that proceeds from thinking clearly, logically, and imaginatively.
When knowledge is pursued for its own sake, what is the effect on the pursuer? Does one become wiser or happier from the pursuit itself, or only from the achievement of its goal? The founders of the College must have recognized that ...
Normally, text is displayed left-justified; however, some HTML display
programs will perform center justification on text placed between the tags
<center> and </center>. (Others will
ignore these tags.)
A computer program that can process HTML formatting commands sees the entire document as consisting of regions of text delimited by tags. Let's look now at the overall structure of an HTML document.
The first and last lines of an HTML document are the tags
<html> and </html>, which mark the
entire text as something than an HTML processor can expect to be able to
interpret.
Between these tags, there are two other pairs: <head> and
</head>, and <body> and
</body>. The regions that these pairs of tags mark off
must not overlap, and usually they will together take in the entire
document, giving it this structure:
4
<html> <head> ... (header information) ... </head> <body> ... (all the interesting text) ... </body> </html>
These tags will appear, in this order, in every HTML document you write.
The header information (anything placed between <head> and
</head>) is not displayed as part of the document, but
can be used by other software tools (such as programs for automatic
indexing or compilation of bibliographies) to obtain information about the
document. For example, regardless of whether you want to display a title
at the beginning of your document, you can (and should) include a title in
the header information, so that both human readers and computer programs
will have a concise way to refer to the document. Place the title in the
header information region, enclosing it between the tags
<title> and </title>. For instance,
the line
<title>First steps in HTML</title>
appears in the header information for this handout.
I recommend that you also include a by-line in the header information. Mine looks like this:
<link rev="made" href="mailto:stone@cs.grinnell.edu">
This is one long tag (you can tell because it's all enclosed in one set of
angle brackets). Unlike the tags we've seen before, it does not start or
end a region of text; it stands alone. Also, unlike the tags discussed
above, a <link> tag can incorporate "attributes" that
qualify or supplement the tag's basic meaning.
The <link> tag that I use for my by-line has two
attributes, rev and href. To incorporate an
attribute into a tag, you write an equation inside the tag that has the
name of the attribute on the left and a word or phrase enclosed in
quotation marks (conventionally, double quotation marks) on the right-hand
side.
Href stands for "hypertext reference," and the phrase that
I've placed between the quotation marks on the right side of the
href equation is my electronic-mail address; it is preceded by
mailto: to indicate that someone who follows the reference
that I've provided--who "moves through" that hypertext link--will send an
e-mail message to me. (From the point of view of a reader, activating that
link causes a window to appear, into which the reader can type the message
that she wishes to send.)
The rev attribute describes the relationship between the
person or object to which the hypertext reference refers (me)
and the current document. The phrase rev="made" indicates
that I am the author of the document and corresponds to the word "by" at
the top of a magazine article.
Now for the body of the document. The internal structure of each document is determined by its author, but HTML provides some support for the classical structure (chapters, sections, subsections) by formatting the titles and headings of these components. Actually, HTML recognizes six levels of structure, and provides a different kind of format for the title and heading of each one. The levels are numbered from 1 to 6, with 1 being the document as a whole, 2 being its major divisions (chapters, perhaps), 3 being the subdivisions (sections), and so on down.
The title of the entire document, if it is to be displayed or printed at
the top, should be enclosed by the tags <h1> and
</h1> ("level-one heading"). A level-one heading is
printed in enormous type and set off with white space above and below. It
need not appear at the beginning of the body; the same font and heading
style will be used wherever it appears. Moreover, you can have as many
level-one headings as you like; HTML does not force any particular internal
structure on you.
The titles of smaller components are enclosed in analogous pairs of tags
(<h2> and </h2>,
<h3> and </h3>, and so on down to
<h6> and </h6>). At each level, the
font gets smaller and the heading less impressive. If you find the
level-one heading font excessive, you can start off at one of the lower
levels; for instance, the title at the top of this document is actually
written as a level-two heading. However, it is regarded as bad style to
descend more than one level at a time (a subheading under a level-two
heading should be at level three, not level four).
HTML display programs like Mosaic and Netscape perform their own word
wrapping and justification ("filling") of lines. They typically try to put
as many words onto a line as possible before going on to the next line. If
you want to prevent this, you can insert the stand-alone tag
<br> at any point in the text; an HTML display program
will then terminate the line right there and resume on the next line.
Alternatively, if you have a long stretch of carefully formatted lines, you
can surround that region with the tags <pre> and
</pre>, and the HTML display will respect all of your
line breaks. (It will also switch to some fixed-spacing font, such as a
typewriter font, so that characters that are vertically aligned in your
carefully formatted input are also vertically aligned on screen.)
Except between <pre> and </pre>, HTML
ignores white space, such as multiple spaces, tab characters, and blank
lines. You must therefore insert markup every time you want to begin a new
paragraph. (It is not necessary, however, to insert <p>
before or after a heading, and some authorities consider it poor style.)
One other way of signalling a break in the flow of text is available: the
horizontal rule. The standalone tag <hr> causes the
HTML display program to draw a line across the display page where it
occurs. (There is a horizontal rule near the beginning of this document,
just above the "Part 1" header; there are three more near the end.)
Word-wrapping and justification are also suspended in one other situation.
You can indicate that you want a list of items, with each item starting on
a new line, by placing the start tag <ul> at the
beginning of the list, the item tag <li> before each
item, and the end tag </ul> after the last one. The
result looks like this in the editor:
<ul>
<li>First item
<li>Second item
<li>Third item
<li>Last item
</ul>
and like this when displayed:
- First item
- Second item
- Third item
- Last item
Notice that each item is preceded by a bullet and indented. Numerals will
appear instead of bullets if you use the tags <ol> and
</ol> at the beginning and end of the list
(ol = ordered list; ul = unordered list).
Any word or phrase in an HTML document can be made into a hypertext link to an Internet resource, such as a WWW page (probably another HTML document), a Gopher server, an ftp server, a telnet session, or an electronic-mail address. (Don't worry if you've never heard of Gopher, ftp, and telnet. You can learn about them if and when you need them.)
The word or phrase through which the link can be activated (the
anchor for the link) is enclosed by the tags
<a> and </a>. Conventionally, both
tags are placed right up next to the anchor, with no intervening blanks,
thus:
<a href="http://www.cs.grinnell.edu/home.html">This is the anchor.</a>
The start tag for the anchor always includes an equation for the
href attribute. The right-hand side of this equation is a
Universal Resource Locator (URL) and specifies how to get to whatever is at
the other end of the link. The http: part specifies the kind
of connection that will be made when the link is activated (the "Hypertext
Transfer Protocol" will be used); in its place, one might alternatively
write
mailto:for an electronic-mail connection,
gopher:to connect to a Gopher server,
telnet:to open a telnet session, or
ftp:to connect to an ftp server.
The //www.cs.grinnell.edu/home.html part of the tag says that
the document at the other end of the link is stored in a file named
home.html on a machine known on the Internet as
www.cs.grinnell.edu. When the link is activated (perhaps by a reader who
moves the mouse pointer onto the anchor and presses a mouse button), the
reader's computer will send a request for the document to
www.cs.grinnell.edu, which will respond by sending it back to be displayed.
In order to write the start tag for an anchor, you'll need the URL of the document that you want to link to. Often it's a document that you've found by searching on the World Wide Web; if so, your WWW client software probably displays the document's URL when you are looking at the page, and you can just copy it into the tag.
Sometimes, however, you want to link to another HTML document that you have
written and stored in a file somewhere. The details of the URL that you'll
need vary from machine to machine. For example, on www.cs.grinnell.edu, an
author with the username spelvin would refer to a file named
frogs.html in his public_html subdirectory by means of the URL
http://www.cs.grinnell.edu/~spelvin/frogs.html
Here ~spelvin/ is a concise way to refer to the
directory (the folder, that is) in which the user named
spelvin keeps his publicly accessible HTML files. On
www.cs.grinnell.edu, this is conventionally the public_html directory.
Various machines name and manage files in various ways, so you'll have to
learn the conventions of your particular World Wide Web server--the machine
through which your HTML documents will be made available.
If a document is long, you may want to include, in one passage, a hypertext
reference to another passage that is much earlier or much later in the same
document. For instance, this is one way of including footnotes in a
document; textually, they can all be placed at the end, but the main body
of the document can include forward references to them, and the curious
reader who wants to jump ahead to the footnote can activate the link by
clicking on the anchor. In such a case, the href attribute
can be much simpler. Here's what a passage containing such an anchor looks
like in your text editor:
As Richard Mitchell says in <i>The gift of fire</i>, "Problem-solving is a wonderful device, and fun, but it ought to be kept in its place" <a href="#mitchell">(note 1)</a>.
And here is what it looks like in the display program:
As Richard Mitchell says in The gift of fire, "Problem-solving is a wonderful device, and fun, but it ought to be kept in its place" (note 1).
Here mitchell is a name for the point in the text that you
want to have at the other end of the hypertext reference. You can choose
the name of an internal hypertext reference to suit yourself. In a
hypertext reference, the mesh character, #, is placed before
the name.
In this case, naturally, you're also responsible for the other end of the
connection. At the point in the text that is being referred to in the
anchor shown above, you insert a second anchor, this one with a
name attribute instead of an href attribute. The
one in this document looks like this:
<a name="mitchell">1</a>
This makes the numeral 1 into an anchor, and gives
that anchor the name mitchell so that it can be used as a
target for hypertext references elsewhere in the document. (Clicking on
an <a name=...> anchor has no effect, since there's no
link in the anchor for the hypertext system to follow.)
A graphic can be associated with an HTML document in either of two ways.
An in-line graphic becomes part of the display, right along with
the text (that is, in the same display window). You can see the effect on
my "front-door" page
(http://www.cs.grinnell.edu/~stone/ ), which includes, as an
in-line graphic, a photograph taken with a digital camera by Margaret
Rauber of the Grinnell College Office of Public Relations. This is the
best way of integrating the graphic with the text.
The <img> tag is used to incorporate an in-line graphic
into an HTML document. The graphic should be stored in a separate file,
and you write the URL for that file on the right-hand side of the equation
for the src ("source") attribute of the
<img> tag. For example, the URL
http://www.cs.grinnell.edu/traditions/Noyce-photo.gif
contains a GIF-format black-and-white photograph of Robert N. Noyce '49. To include that photograph in an HTML document, one would write
<img src="http://www.cs.grinnell.edu/traditions/Noyce-photo.gif">
at the appropriate point of insertion.
Unfortunately, in-line graphics must be in one of two formats: GIF ("Graphics Interchange Format") or XBM (X Windows bitmap format). If the image you want to attach to your HTML document is in some other format, and you can't or don't want to convert it, you can still put a link to it into your document, making it an external graphic. When a reader's HTML display program encounters a link to an external graphic, it looks around for an auxiliary program called a viewer that can be used to display such graphics. If the HTML display program finds an appropriate viewer, it starts the viewer running, recovers the graphic from whatever machine is providing it, and hands it over to the viewer, which displays it in another window. If the reader's HTML display program can't find a viewer, she's out of luck; you've made the graphic available, but she doesn't have the resources to look at it.
This means that in addition to the HTML display program, a computer that is equipped to see all that there is to see on the World Wide Web must have many viewers, and the HTML display program has to be able to find them and start them up automatically. How to arrange this depends on what kind of a machine you have. The good news is that for educational institutions viewers are often freeware. See the NCSA Mosaic Frequently Asked Questions page for more details.
For an external graphic, you don't use <img>; instead,
you provide an anchor in the text with a hypertext reference to the file
containing the graphic. When the reader activates the link, the
appropriate viewer is started if it is available. For instance, if you're
reading this through an HTML display program on a machine that also has a
viewer for TIFF-format files, you can see the Noyce
photograph at full size by clicking on the anchor in this sentence.
Here is what the anchor and the associated markup look like:
<a href="http://www.cs.grinnell.edu/traditions/Noyce-photo.tiff">Noyce photograph</a>
Files containing audio recordings, movies, documents prepared by other formatting systems, and so on are treated as external graphics: If the reader's computer has the appropriate software to display, emit, or play the contents of the file, the hypertext reference will run it.
You may already have a stock of worthwhile documents, prepared before you learned about HTML, that you would like to make available through the World Wide Web. You would prefer not to rewrite them from scratch; ideally, you'd rather not even retype them. Is there any way to avoid inserting all the HTML markup?
The answer is yes. Programs like Mosaic and Netscape expect HTML markup
only in files that contain the tags <html> and
</html>. Text files lacking these tags will still be
displayed, but in a no-frills format: a fixed-width font, no fancy
headings, no graphics, no links.
However, if you created your files with Word Perfect or Microsoft Word, you will have to remove all the alien markup associated with those programs from your documents in order for Mosaic or Netscape to display them. Check the documentation for your word processing software; it may well provide a way to do this automatically. You want the files to be stored in "ASCII text" format.
Eventually, new versions of these well-known word-processing programs will include an "HTML mode," so that you can use them to prepare HTML documents. There will also be conversion programs that translate, say, Word Perfect markup into HTML markup. One potentially useful conversion program that is already available is rtftohtml. If you save your Word Perfect or Microsoft Word document in as a "Rich Text Format" (RTF) file, rtftohtml will strip out the RTF markup and put HTML markup in its place. (That's the claim, anyway; I haven't used this program myself and cannot confirm that it performs as advertised.) You can find out more about rtftohtml at the WWW page
There are many World Wide Web pages dealing with HTML. Here are a few that I found useful while I was preparing for this workshop:
http://www.ora.com/gnn/bus/ora/features/html/index.html)
http://kuhttp.cc.ukans.edu/lynx_help/HTML_quick.html)
http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html)
Each of these pages contains links to still more documentation.
I also consulted
1. Richard Mitchell, The gift of fire (New
York: Simon & Schuster, Inc., 1987), p. 84. If you're reading this
document electronically, you may have reached this footnote by clicking on
the anchor "(note 1)" earlier in this document. You came here
because the numeral at the beginning of this note is enclosed in the tags
<a name="mitchell"> and </a>. Click
here to return to that earlier point in the
document.
This document is available on the World Wide Web as
http://www.cs.grinnell.edu/~stone/events/html-workshop/first-steps.html
created April 11, 1995
last revised December 15, 2003