What are abstract data types?
An abstract data type is a collection of values and operations on those values, considered without reference to how such values might be represented or how such operations might be implemented. The implementation is what is being taken away or eliminated by abstraction.
When dealing with abstract data types, do we or do we not need to consider implementation? And what isn't an abstract data type? It seems like they all fit this mold...
The term `abstract' is not used to classify data types -- it's not that
some of them are abstract and others are concrete. An abstract data type
is one that is being considered in a particular way -- without reference to
its representation, without considering how it would be implemented. In
the handout on characters, it looks at first glance as if I'm going over
the same ground twice -- what's the difference between the operation
upcase (as specified in the first half of the handout) and the
Pascal function Upcase (as defined in the second half)? The
answer is that in the first half of the handout, I'm considering
characters as an abstract data type -- just the values and the possible
operations on them -- and in the second half I'm showing how to implement
this type fully in Pascal (at which point it's no longer an abstract data
type, because I'm no longer considering it without reference to its
representation).
Considering the data type first as an abstraction is actually a helpful programming technique, because it enables you to think about and design a selection of operations that reflects the nature of the type itself. Unless instructed to adopt this approach, most programmers ``design'' whatever they think will be easiest to implement, which is a good short-term strategy for building small programs but fails badly in the longer run -- irrelevant and accidental characteristics of a particular programming language, operating system, or networking environment show through too much.
I understand how the computer tells which integer, character, or real number it is reading. What I don't understand is how the computer knows if it is reading a character, an integer, or a real. Is there some part of a word that identifies what data type it is?
In some programming languages, such as Scheme, it is necessary for each value that is stored in memory to be tagged in the way you suggested with an indication of the type to which it belongs. Such tags are not needed in Pascal, since all the necessary type information is already available when the program is compiled.
When translating a call to the Read or ReadLn
procedure, the Pascal compiler examines each variable for which a value
must be read in and infers the type of that variable from its declaration.
(Since every Pascal variable must be declared before it is used, the
compiler will already have seen the declaration and stored the variable's
type in a ``symbol table'' that it builds and maintains during
compilation.) The compiler then generates different machine instructions,
depending on whether the program calls for the reading of a
Char, Integer, or Real value.
In other words, although it appears to the Pascal programmer that there is
only one Read procedure, there are actually three different
ones, and the compiler decides which one to use for each value that is read
by inspecting the type of the variable in which the value will be
placed.
Where does the computer get the ability to distinguish different data types?
The computer has no such ability. If the programmer stores a value of type
Real in a particular location in memory, and then lies to the
computer by telling it that there is a value of type Integer
stored there, the computer will happily interpret the pattern of zeroes and
ones that occupies that storage location as if it were an integer.
Pascal makes it rather difficult for the programmer to tell such a lie (and specifies that it is an error for him to do so). In some programming languages, it is completely impossible to express the lie; in others, it is extremely easy and indeed routine -- the programmer is simply made responsible for the consequences of lying to the computer.
Why is Pascal so strictly typed? Having learned Scheme two years ago, I had grown accustomed to being able to put nearly whatever I wanted into a variable, etc. I miss that flexibility, and though I recognize that most languages cannot be so dynamic, Pascal seems to taking typing to an extreme.
There are three main reasons:
For global variables, the compiler simply starts at 0 and allocates
successive locations for successive variables. Often, as on the HPs, a few
locations may be skipped in order to allow advantageous alignments; for
instance, if the address of the next available location is 390, a variable
of type Integer might be given the address 392, leaving bytes
390 and 391 unused, because an integer can be transferred more rapidly from
the memory to a register in the central processing unit if its address is
divisible by 4.
When the compiled program is executed, the global ``addresses'' generated by the compiler are actually interpreted as offsets from a base address established by the operating system.
The addresses of parameters and local variables are handled similarly, except that their ``base addresses'' are established during program execution by the Pascal run-time system rather than by the operating system.
Since the character digit-zero is represented by the bit-pattern 00011000 in ASCII, and the integer 48 is also 00011000, and the forty-ninth component of an enumerated type is also 00011000, how can the computer distiguish one from another?
Actually, an integer value requires thirty-two bits under HP Pascal, so the representation of the integer 48 is actually 00000000000000000000000000011000. But in general your question is a good one: If values of different data types have the same bit-pattern and occupy the same amount of storage, how can the computer tell them apart?
The answer is that it cannot. If the programmer stores the forty-ninth
value of an enumerated type in a particular location in memory, and then
lies to the computer by telling it that there is a value of type
Char stored there, the computer will happily interpret the
pattern of zeroes and ones that occupies that storage location as
digit-zero.
Pascal makes it rather difficult for the programmer to tell such a lie (and specifies that it is an error for him to do so). In some programming languages, it is completely impossible to express the lie; in others, it is extremely easy and indeed routine -- the programmer is simply made responsible for the consequences of lying to the computer.
Pascal, you have said in class, makes it difficult for the programmer to examine data as types other than Pascal thinks they are. An integer cannot be read as a real, or a char. Why is this worth mentioning? Why would somebody want to look at data as if it were something it's not?
Because sometimes it's more efficient to bypass the interface that the
designer of the data type set up. For instance, suppose that you want to
know whether the value of the Integer variable
Position is evenly divisible by 4. The designer of Pascal's
Integer data type wants you to write
Position mod 4 = 0to find this out. This involves doing a division, which is the most time-consuming of the arithmetic operations. If you happen to know that on HPs an integer is divisible by 4 if, and only if, bits 1 and 0 of its internal representation are ``off'' -- zeroes -- and if your programming language allows you to test these bits using machine instructions that are faster than the division operation, you will be able to speed up your program by looking at the integer as a sequence of bits instead.
Another example: Suppose that you have determined that the value of the
variable Ch, of type Char, is a lower-case
letter, and you want to replace it with the corresponding capital letter.
In Pascal, you write something like
Distance := Ord ('a') - Ord ('A');
Ch := Chr (Ord (Ch) - Distance)
for this purpose. The C programming language encourages programmers to
perform arithmetic directly on characters, as if they were integers. If
you could do this in Pascal, you would write
Distance := 'a' - 'A'; Ch := Ch - DistanceWhich is better? A good compiler would generate the same machine instructions in either case. The Pascal version is perhaps clearer but more cumbersome.
I was talking with my friend Greg from Madison, and the issue of Pascal's strong typing came up. I remembered you talking about how hard it is to get a sorting procedure to get more than one type of input, and Greg had an idea of how to solve it. He thought of a function that could be placed which would help with the process. Before the sorting procedure is called, a function stores the data, be it chars or reals, into a binary file of that type. The function would as need to take in a symbol of what type the binary file contained. Then the function would have a case statement that would call the sorting procedure with the correct type, adding a code to tell it how to unravel it. I would maybe try to write this, and want your opinion on whether it is possible, or even worth the time it would take as a constant before sorting.
Let's see. You could write data of any type into a binary file, then have the operating system attach the same file to a different Pascal variable that would treat the data in the file as being of some neutral type, perhaps an array of bytes, distinguished only by its length. You could read the file into an array of objects of the neutral type, sort it, and put it back in the file, then have the operating system reattach the file to the original Pascal file and read the sorted data back in from it.
This could work. In the middle of the process, you're lying to Pascal about the nature of the data being sorted, but Pascal will never know the difference. A similar approach along the same lines, which would be more efficient because it would avoid the need for file operations, would be to define a variant record type with two variants, one the original data type, the other the byte array of the appropriate length. Store the data in the first place using the first variant; call a sorting procedure in which the data is treated as being of the type of the second variant; output the data using the first variant again.
In HP Pascal, still another possibility is available whenever the data
values are accessed through pointers. (As you may have inferred from the
modules we've recently studied, this is often the case in real Pascal
programs.) By declaring the parameter of a sorting method to be an array
of values of HP Pascal's LocalAnyPtr type, one can use the
sort with any array of pointers, regardless of their real base type.
Should I worry about remembering the difference between the types of parentheses and brackets in BNF grammar if I am comfortable with syntax diagrams?
If you can find an accurate syntax diagram for every syntactic construction that you're interested in, and if you think that such diagrams are more readable than BNF productions, then you don't need to be too concerned about the mechanics of BNF. Besides, the particular conventions that authors use to write BNF vary, so that every time you run across a BNF grammar you may have to study the context in order to figure out exactly what the parentheses, brackets, and braces mean.
However, the conventions that Cooper and Walker use aren't all that hard to learn: Parentheses are solely for grouping alternatives. Brackets enclose optional constituents. Braces enclose constituents that can be repeated 0 or more times.
To put it in terms of syntax diagrams: Parentheses disappear in syntax diagrams. Brackets correspond to temporary forks in the track, one line containing the optional construction, the other one empty; the two lines merge again afterwards. Braces correspond to loops in the track; one can go past the loop on the main track, or around the loop any number of times before proceeding.
How important is it to understand both syntax diagrams and BNF grammars? Is BNF more widely used?
BNF is far more widely used than syntax diagrams, mainly because it's easier to store BNF grammars in text files and to write programs that will read, parse, and do useful things with them.
Why does Pascal require you to compile a program before you run it? I learned to program in BASIC, and it translated as it ran. Is it that much more efficient to compile it beforehand? Or is it done to catch errors before the program is run?
Efficiency is the main reason. There used to be several Pascal processors that used interpretation rather than compilation; they would translate from Pascal to an intermediate ``P-code,'' which the user would then execute on a ``Pascal virtual machine.'' The Pascal virtual machine was a piece of software that read and interpreted P-code. The whole process was similar to the process of saving and later running a BASIC program on, say, an Apple II; BASIC programs on the Apple II were stored in a reduced form that was interpreted by the RUN command.
The translation part of the Pascal interpreter ran faster than a full compilation; the difficulty was that the Pascal virtual machine was usually very slow, perhaps five to ten times slower than compiled code. The interpreter was therefore used primarily during program development. The finished version would be run through the compiler before being released.
Pascal interpreters have now almost disappeared, because Pascal compilers are so much faster on modern computers. When translation of a thousand-line program took ninety seconds under pi (a P-code generator) and five minutes under pc, it was worth while for a student pressed for time to use pi. But when pi takes a second and a half and pc takes five seconds, what's the point?
Is it better in general to write a long program that uses lots of very simple procedures from a library or to do the same job with a short program containing very efficient but very specialized procedures?
Usually it is better to write the first version of the program with general procedures that are already available. If this first version is too inefficient or clumsy, you can rewrite it later to use more specific procedures, but in most cases rewriting is superfluous -- the first version works well enough.
When describing a procedure should I mention procedures it calls?
Only if it simpifies or clarifies the description of the current procedure. It's a question of style, not correctness; but in general the description will better reflect the modularity of the procedure if you avoid gratuitous references to other procedures.
If procedure A calls procedure B, and the
definition of procedure B is completely nested within the
definition of procedure A, then there's a somewhat sounder
argument for mentioning procedure B in the opening comment for
A; but even in this case it is more usual for the relevant
facts to be placed in the comment at the beginning of the definition of
procedure B instead.
Suppose in my program I have two layers of procedures. The first layer is
Procedure1 and Procedure2. The second layer is
Procedure11 and Procedure12,
both of which belong to Procedure1 and have nothing to do with
Procedure2. There is a data type that I want to use in
Procedure1 and as the type of a parameter of
Procedure11. This data type has nothing to do with
Procedure2 or Procedure12. Do you think I can
just declare this data type in Procedure1?
That's the ideal place for it. In general, if you have any identifier that is used only inside a procedure, it should be defined or declared inside that procedure.
Variable names give me trouble, especially as these programs grow longer
and more complex. Is there a set of guidelines that we might want to
follow? I miss the ability to give variables names like
player_r for player record -- somehow playerr
doesn't do it, and PlayerRecord is too many characters.
HP Pascal allows you to use the underscore as a break character, as you prefer. And it's fairly easy to write a utility that will strip all the underscores out of a program when you're ready to port it to some less tolerant implementation of Pascal.
In general, global identifiers should be very explicit, even at the expense of more typing; an abbreviation is more confusing if the point at which it is used is far away from the point at which it is defined, as often happens with globals. If you're going to use a short, cryptic identifier, be sure to attach an explanatory comment at the point where it is defined.
How important are variable names? I have noticed that some programmers use very elaborate variable names, making the program easier to decipher yet a lot messier and harder to read, while other programs I have seen in books are written with very basic variable names, often just letters. Do you feel it is far better to use elaborate variable names? If so, do you tend to take points off for programs that don't use elaborate names?
Choosing meaningful identifiers is a fundamental part of good programming style. Often I have found that an author's poor programming style undermines an otherwise excellent textbook -- and the most common error in such cases is the use of cryptic identifiers, often arbitrarily selected single letters. A case in point is Robert Sedgewick's Algorithms, which we've tried to use twice in different courses in the department, because the prose explanations are so good. Both times, the students found the accompanying source code unintelligible -- useless at worst, unhelpful at best..
The original cause of the problem is that the first programming language of many programmers was either FORTRAN or some rudimentary form of BASIC. In FORTRAN, it's a rule of the language that no identifier can be more than six characters long. Also, many programs directly reflect the mathematical notation in which the problems they solve are specified; since mathematicians use single letters as variables, FORTRAN programmers tend to do so as well. In some dialects of BASIC, including the one I first learned, the interpreter was capable of recognizing an identifier only if it was either a single letter or a single letter followed by a digit. Such restrictive programming languages produced a generation of programmers with the bad habit of using single-letter identifiers for everything.
The rule that I go by is that every identifier should be self-explanatory unless it is used only within a few lines of its point of definition. This means that I tend to use long identifiers in global definitions and declarations but sometimes use short ones locally. I also avoid abbreviations and acronyms; too often they mystify the reader.
However, I generally take points off for this error of style only if it makes it more difficult for me to understand a student's paper. Usually this happens only in combination with other stylistic errors (insufficient documentation, over-long procedures, irregular indentation, etc.).
After every occurrence of end, should I indicate in braces
what is ending, thus:
end {for loop}
It's up to you. A lot of programmers seem to find comments of this kind helpful, especially when control structures are deeply nested. I don't object to them, but I don't find them particularly helpful either, because I'm very careful and regular about indentation and can almost always match up the beginning and end of a compound statement by observing which lines are indented to the same column.
Do you have any general guidelines for comments in source code we turn in? (My high school CS teacher had a very standard format that she demanded we follow, so I was wondering if you had a similar policy.) My main concern is that I might be overdocumenting my source code.
It is practically impossible to overdocument source code, though it is possible to document it too mechanically, without giving the comments enough thought. (The usual symptom of this is comments that are redundant, repeating information that is immediately obvious from the code itself rather than explaining the purpose of a variable or a procedure or the rationale for a programming choice.) Quality is more important than volume.
I have two habits that I recomment to students: (1) I write a long opening comment at the beginning of each program, in which I describe the problem that the program is supposed to solve and the general approach to a solution that is used in the program. (2) I also write a one- or two-line comment on every definition and declaration in the program, explaining what the identifier that is being defined is for. As a result, I seldom have to include comments in the executable part of a program, procedure, or function.
For instance, I've finished my solution to exercise 1. It begins with a comment of forty lines, and after this opening comment about 42% of the subsequent lines are inside comments.
When does it become appropriate to use HP Pascal's Assert
function to enforce preconditions?
Please use it as soon as you have learned its syntax and the meanings of its parameters, which are discussed in chapter 9 of the HP Pascal / HP-UX reference manual.
My question is about the Assert function and the exception
codes declared in the sequence module. (Or any of the modules in your
handouts, for that matter.) You declare an ExceptionException
constant, but the only place this enters into the code is to check to see
if the exception code generated is valid. If you only call the
Assert function with valid codes, why is it necessary to check
the incoming exception code? Is it just a design principle? Would you
ever expect to crash out of the program with this exception code? What
could cause the program to crash with this error?
In principle, it's not necessary to check the incoming exception code.
Since the SequenceExceptionHandler procedure is not exported,
one can see all the places at which it can possibly be invoked just by
reading through the implement section of the module in which
it is defined. In each of these places, it is invoked with a valid
exception code. This means that the program cannot possibly generate the
ExceptionException error and that, in principle, it is
pointless to provide for such an error.
The problem with this principle is that it is too fragile; it wouldn't take
much of a change in the module to break it, and it depends on the
``voluntary cooperation'' of a lot of separate procedures and functions
that may in the future be rewritten, by me, by you, or by someone who
borrows the code from the WWW site. Providing the
ExceptionException is a kind of defensive programming --
overdesigning the software to compensate for the fact that very frequently
during the execution of real-world programs things that ``can't happen''
happen.
I'm having trouble writing Assert calls in my own programs.
From a defensive-programming standpoint, would you prefer to see an
Assert call in the following procedure block, or not?
procedure InsertIntoTree(var Tree : TreePtr;
NewInfo : DataType);
procedure InitEntry(var Ptr : PtrType;
Data : DataType);
begin
Ptr^.Info:=Data;
Ptr^.Left:=NIL;
Ptr^.Right:=NIL;
end;
begin
if Empty(Tree) then begin
new(Tree);
InitEntry(Tree);
end
else
if LessThan(Tree^.Info,Data) then
InsertIntoTree(Tree^.Left)
else InsertIntoTree(Tree^.Right);
end;
Since InitEntry is a local procedure to
InsertIntoTree, it can only be called from
InsertIntoTree, which properly checks the implicit
precondition that Ptr is not NIL. Is there a
need for an Assert statement? On one hand, dereferencing a
pointer should always be checked by an Assert procedure, but
if InitEntry is only called when its preconditions have
already been checked and cannot be called from another procedure, is there
really a purpose in providing an Assert call, other than
perhaps somebody who doesn't understand the library will come and revise
it, and not properly check the precondition before invoking
InitEntry? Would you recommend checking preconditions anyway,
just in case I were to go and add to the library later on and forget to be
careful?
No, in this case I don't see the need for a call to Assert. I
might feel slightly differently if the call to InitEntry were
located farther way from the definition of that procedure, but in this case
it's easy to detect the relationship between the two procedures.
However, your question illustrates a curious phenomenon in programming:
When you're not sure whether an assertion is needed, or (in other cases)
when you try to write an assertion and find that it's very cumbersome, it's
often a sign that you're trying to do the wrong thing -- usually, that
you're trying to modularize the program incorrectly. In the particular
case you cite, you can avoid the whole problem by moving the call to
New into the InitEntry procedure where it
belongs:
procedure InsertIntoTree (var Tree: TreePtr; NewInfo: DataType);
procedure InitEntry (var Ptr: PtrType; Data: DataType);
begin
New (Ptr);
Ptr^.Info := Data;
Ptr^.Left := nil;
Ptr^.Right := nil
end;
begin
if Empty (Tree) then
InitEntry (Tree, NewInfo)
else if LessThan (Tree^.Info, NewInfo) then
InsertIntoTree (Tree^.Left, NewInfo)
else
InsertIntoTree (Tree^.Right, NewInfo)
end;
In the code for Walker's procedure to print the information for a
baseball player, he declares Ind as a variable parameter.
Since the procedure is printing (and thus not changing) information, what
is the rationale for declaring Ind as a variable
parameter?It makes the mechanism for the procedure call more efficient. When a large data structure is passed by value, it must be copied into the storage allocated for the parameter; this copying process takes an amount of type proportional to the size of the structure. When the same data structure is passed by reference, only its address is actually copied into the storage set aside for the procedure. Since addresses are small and have a fixed size, this mechanism is faster.
I am not too clear on the distinction between an error and a violation. Page 1 of Standard Pascal makes them sound like one and the same, and page 100 draws a distinction between them.
There are two kinds of violations of the rules of standard Pascal: those that can be detected by inspecting the text of the alleged program, without trying to execute any of it, and those that can only be detected by executing part or all of the alleged program. Violations of the latter kind are errors; violations of the first kind don't have a separate name, but when, on page 100, Cooper contrasts errors with ``violations,'' he means violations of the first kind.
A standard Pascal processor is required to detect and report violations of the first kind. It is not required to detect all errors if its documentation lists the classes of errors that that it does not detect.
Could you go over the difference between implementation-defined and implementation-dependent?
Sure. First, here's what the standard says:
3.3. Implementation-Defined. Possibly differing between processors, but defined for any particular processor.For example, HP Pascal provides both3.4. Implementation-Dependent. Possibly differing between processors and not necessarily defined for any particular processor.
MaxInt and
MinInt; these are pre-defined constants, the greatest and the
least values of the Integer data type. MaxInt is
implementation-defined: Every standard Pascal processor must recognize
this identifier as a constant, but it may denote different values in
different Pascal systems. MinInt is implementation-dependent:
Some standard Pascal processors will not pre-define it, and those that do
may give it different values.Are we allowed to use the "non-ANSI-Pascal" features of pc for our programs? In other words, can we invoke pc without the -A option which comes defined with pgo in the standard MathLAN account?
Yes. For example, the Pascal standard says that it's the job of the
operating system to attach files for input and output to the relevant
Pascal file variables. Consequently, the Reset and
Rewrite procedures in standard Pascal take only one argument.
But the designers of HP Pascal instead left it up to the programmer to
attach such files, and hence provided two-argument versions of these
procedures:
Reset (Source, '/users/spelvin/frogs.dat'); Rewrite (Target, 'frogs.out');In this case, I encourage you to use the two-argument forms even though they are non-standard. It is practically impossible to ensure the portability of calls to
Reset and Rewrite anyway,
since there is so much variety in the mechanisms that are used to attach
files to Pascal file variables.Is it possible, using the HP compiler, to nest comments?
No. Nesting of comments is contrary to the standard (see Cooper, p. 7). Some implementations of Pascal allow the nesting of comments, but this is a mistake.
The following test program can be used to determine whether a given implementation of Pascal allows nesting of comments:
program Comments (Output);
const
{ { }
Nestable = False;
First = '} Nestable = True; {';
Second = '} Third = '''{' { };
begin
WriteLn ('It is ', Nestable, ' that comments are nestable.')
end.
Under HP Pascal, this program produces the output
It is FALSE that comments are nestable.I'm a little confused about program parameters. I know that
Input and Output are necessary to read from the
keyboard and write to the screen. It seems that any other parameters are
somewhat superfluous, since you have to redefine them in the VAR section of
the program block anyway.In the original implementation of Pascal, the parameters in the program header were supposed to correspond to command-line arguments. This requirement was abandoned before the language was standardized, since Pascal was implemented under some operating systems that either had no notion of a command-line argument or did not provide easy access to them, but many implementations of Pascal continue to make some special use of the program parameters, so simply deleting them from the language would invalidate a lot of Pascal code.
Would it be possible to invoke a program within the program, like a procedure within the procedure, thus allowing us to make programs themselves recursive?
The only way to do this, within Pascal's syntax, is to move the body of
your main program into a Control procedure, replacing it with
a single-statement program body that is a call to this procedure, and then
to invoke Control recursively when you want to ``invoke the
program.'' There is no way to re-invoke the main program body from within
a procedure or function definition in Pascal.
Is there a random number generator in HP Pascal? I couldn't find any mention of it in the LaserROM manual.
No, there isn't. Here's a canned one that I can recommend:
var
RandomSeed: Integer;
{ the current value in the sequence of integers produced by the
generator; this variable must be initialized so as to provide a
starting point from which to develop values }
{ The Randomize procedure sets the value of RandomSeed to an initial
value that depends on its parameter. }
procedure Randomize (Offering: Integer);
begin
RandomSeed := 1 + Offering mod MaxInt
end;
{ The Random function uses the linear-congruential method to generate a
pseudo-random value in the range (0.0, 1.0]. The particular generator
mentioned here is proposed as a standard by Stephen K. Park and Keith
W. Miller, in ``Random number generators: Good ones are hard to find,''
COMMUNICATIONS OF THE ACM 31, 1192--1201. To avoid integer overflow, the
modulus is separated into two parts, Quotient and Remainder, such that
Modulus = Quotient * Multiplier + Remainder, and similarly RandomSeed is
separated into a high segment and a low segment such that RandomSeed =
Quotient * HighSegment + LowSegment. The new value to be computed is
then
Multiplier * RandomSeed mod Modulus
= (Multiplier * Quotient * HighSegment + Multiplier * LowSegment)
mod (Multiplier * Quotient + Remainder)
= (Multiplier * Quotient * HighSegment + Remainder * HighSegment
+ Multiplier * LowSegment - Remainder * HighSegment)
mod (Multiplier * Quotient + Remainder)
= (Multiplier * LowSegment - Remainder * HighSegment)
mod (Multiplier * Quotient + Remainder)
which can be computed without overflow, since the highest possible value
of Multiplier * LowSegment - Remainder * HighSegment, even assuming a
HighSegment value of 0, is less than Multiplier * Quotient, which is less
than Modulus, which is equal to MaxInt, and the lowest possible value of
the same expression, even assuming a LowSegment value of zero, is
-Remainder * HighSegment, which is greater than -Multiplier * Quotient,
etc.
The article cited above recommends that 16807 be used as the value of
Multiplier, and consequently 127773 as the value of Quotient and 2836 as
the value of Remainder. In a subsequent note (in ``Technical
Correspondence,'' COMMUNICATIONS OF THE ACM 36, number 7, 108--110), Park
et al. suggest the multiplier 48271 instead. This change has been made
in the code below. }
function Random: Real;
const
Multiplier = 48271;
Modulus = 2147483647; { = 2^31 - 1 }
Quotient = 44488; { = Modulus div Multiplier }
Remainder = 3399; { = Modulus mod Multiplier }
var
HighSegment: Integer;
{ the number of times Quotient goes into RandomSeed }
LowSegment: Integer;
{ the remainder when RandomSeed is divided by Quotient }
Test: integer;
{ RandomSeed * Multiplier mod Modulus, possibly ``wrapped around'' to a
negative number that is off by Modulus }
begin
HighSegment := RandomSeed div Quotient;
LowSegment := RandomSeed mod Quotient;
Test := Multiplier * LowSegment - Remainder * HighSegment;
if 0 < Test then
RandomSeed := Test
else
RandomSeed := Test + Modulus;
Random := RandomSeed / Modulus
end;
Is there a relationship between the possible size of Random
and the size of the seed number?
Standard Pascal doesn't provide a Random function, or indeed
any kind of a random-number generator, so I'll take this as a question
about the random-number generator that Walker develops on pages 106 through
109 of the text.
The range of possible values returned by a Random function is
independent of the range of values of the seed, but the particular value
returned on any one call to Random is proportional to the
current value of the seed.
Could you explain big-O notation?
Take any two functions, f and g, that take positive integers as arguments and produce positive real numbers as values. The statement that f is of order g -- in symbols, f(n) = O(g(n)) -- means that, once the arguments are sufficiently large, the ratio between the values produced by f and those produced by g is bounded by a constant, so that in effect f grows no more rapidly that some fixed multiple of g.
Formally, f(n) = O(g(n)) is defined to mean that there is a positive integer m and a positive real c such that, for every integer n greater than or equal to m, f(n) <= cg(n).
In the context of the classification of algorithms, the function g that describes the order is some simple function like n^2 or lg n, and the function f is intended to characterize the running time of the algorithm as a function of the size of its input.
Is there a good way to evaluate average-case efficiencies of algorithms?
Sometimes. There is no universally applicable method for analyzing algorithms, any more than there is a single method for proving mathematical theorems. Often it is easier to analyze the worst-case efficiency of an algorithm than to analyze its average-case efficiency; it may even be difficult to decide how to take an average for, say, a sorting algorithm: Should one assume that every initial permutation of the elements of an array is equally likely, or should one try to weight the average in favor of cases that might arise more frequently in practice?
In the bignum handout, you say that the division algorithm is proved correct. Is there a formalized method of proving the correctness of algorithms? It seems like something which mathematical inductive reasoning could apply to quite well. And if there is such a way, is it just too complicated to apply to windowing systems?
Yes, a lot of work has been done on methods for constructing formal proofs of correctness. Two good introductory books are A method of programming, by Edsger W. Dijkstra and W. H. J. Feijen (Reading, Massachusetts: Addison-Wesley Publishing Company, 1988), and The science of programming, by David Gries (New York: Springer-Verlag, 1981).
There are five reasons why correctness proofs are not routinely constructed for windowing systems:
No, but you can redirect the error messages to a file, and then use an
editor or a pager to inspect the file. To tell pc to store the
error messages in a file called errors, overwriting the previous
contents of that file (if any), add the redirection clause >&!
errors to the end of the command, thus:
pc -o myprogram myprogram.p >&! errorsBe sure to leave a space after the exclamation point.
What is a good way to track down run time errors on the HP's? My program compiles but won't run.
There's an interactive debugger named xdb. Here's how it works:
You use xdb when you have a program that is syntactically correct (the compiler can succeed in translating it) but semantically incorrect (it gives the wrong answers, at least part of the time). To prepare your program for use with xdb, you must compile it with the -g option:
pc -o frogs -g +N frogs.pThe -g option directs the compiler to provide the ``hooks'' that xdb requires.
Subsequently, you can start up xdb to debug an executable called frogs by typing
xdb frogsin an hpterm window.
When you activate xdb, it takes over your hpterm window and divides it into two subwindows, separated by a status line. The subwindow below the status line records your interactions with xdb; the one above the line displays part of the source code for the program that you're working on. The status line itself tells what file contains the part of the source code that is being displayed, what line of that file contains the statement that will be executed next, and what function that statement belongs to.
The text in the display subwindow cannot be edited; it passively exhibits the source code from which the executable was compiled. If you want to debug interactively, making changes in your code, recompiling, and reloading the debugger, you'll need at least two windows: Emacs to do the editing and recompiling, and xdb-in-hpterm to do the debugging. (To compile from within Emacs, click on the meshing-gears icon, third from the right on the toolbar, and type in the command that performs the compilation.)
You issue commands to xdb at the prompt, a greater-than sign, in the interaction subwindow. Output from the program is done by default by default in the same subwindow, which is sometimes confusing. To arrange for the program's output to be done in a different window, proceed as follows: Start up a different hpterm window and type tty at the hpterm prompt; you'll see a file name, something like /dev/tty/ttyp8. This is the operating system's way of identifying the window as a source of input and a receiver for output. Start xdb with the -o and -e options, specifying this file name after each one:
xdb -o /dev/tty/ttyp8 -e /dev/tty/ttyp8 frogsThis will direct xdb to connect frogs's Output and standard error facilities to the specified window. (You could even connect them to different windows if you prefer.)
There is also an -i option that can be used to connect stdin to a different window, but it's less satisfactory because it doesn't arrange for the input to be echoed -- you won't be able to see what you're typing.
Starting the program. The r command starts executing the program at the beginning. Execution continues until the program crashes or a breakpoint (see below) or the end of the program is reached. Unless you're trying to determine the point at which your program is crashing, you generally want to set some breakpoints before you type r. You can pass command-line arguments to your program by typing them after the r; subsequent uses of r provide the same command-line arguments automatically, unless you override them with new ones.
Executing the program one statement at a time. The s command executes exactly one statement in the program, the one on the line that is marked (with a greater-than sign) in the display subwindow. The S command does the same thing, except that it treats a function call as a single step, while s breaks the function down into its component statements. If you supply an integer after either s or S, the specified number of statements is executed as a group.
Viewing a different section of the source code. The v command adjusts the display so that a specified line of code is visible. An unsigned integer after v requests the line at that position in the currently displayed file. You can ask for a different file by typing the file name after the v, and for a particular line of that other file by attaching a colon and the line number after the file name. To move the display up or down by a specified number of lines, use the + command (down) or the - command (up). In either case, write the number of lines after the sign. The V command returns the display to the line that will be executed next.
Setting a breakpoint. To stop program execution at a specific point in the program, use the b command: b followed by a line number (or a file name, a colon, and a line number) marks that line as a breakpoint, so that execution stops every time that line is reached. The command bb sets a breakpoint at the beginning of the currently executing function. The command bp sets a breakpoint at the beginning of every function. The command lb displays a list of the breakpoints that have been set. The command db removes all those breakpoints; if followed by an integer, it removes only the breakpoint with that serial number (as shown by lb).
Continuing from a breakpoint. To resume program execution after a breakpoint, use the c command. (Alternatively, you could start over again with r, or advance cautiously a step at a time with s.)
Displaying the value of a variable. The p command prints out the value of a variable; type the variable after the p. The variable can be a simple identifier, an array reference, a structure or a field of a structure, or a dereferenced pointer expression. After you have displayed the value of an array element, the command p+ displays the next element of the array, and p- displays the previous element.
Changing the value of a variable. To modify the value of a variable, use the p command, but type an equal sign and the new value after the variable.
Exhibiting the run-time stack. The t command shows the names and parameter values of functions that have been invoked but have not yet exited. The T command shows the same information, along with the values of all local variables in each of those functions.
Exiting from xdb. The q command shuts down xdb in an orderly way.
Could you please explain what exactly a bus error is?
A bus is a communication pathway connecting two or more devices, such as the central processor and the memory of a computer. A bus error occurs when a program tries to use the bus without meeting the preconditions for its correct use. For example, the central processor recovers a value from a storage location in memory by sending its address to the memory unit by means of a bus. If it generates a bogus address -- one that does not denote any location in the actual memory -- it's possible for a bus error to occur. One common cause of such an error is dereferencing a pointer variable to which no value has ever been assigned; the random bit pattern in the pointer variable is then treated as an address, and such an address is very likely to be bogus.
Why is it that so often changing the order of seemingly unrelated statements will completely change the workings of a program?
Usually, the explanation is that each of the statements has side effects, and the effects of whichever statement comes first change the conditions under which the subsequent ones are executed.
I didn't understand the 'read' statement that was listed with Boolean functions. Can you, for instance, say 'IF READ (X) THEN ...'? If so, when would READ (X) return a false value? Could you please give me an example of this usage of 'read'?
In Pascal, you would implement the read operation for Booleans as a procedure with the header
procedure ReadBoolean (var Source: Text; var Legend: Boolean; var Success: Boolean);and you would invoke it in some such context as this:
Reset (Source, RosterFileName);
VoterNumber := 0;
while not EOF (Source) do begin
VoterNumber := VoterNumber + 1;
ReadString (Source, Roster[VoterNumber].Name);
ReadBoolean (Source, Roster[VoterNumber].Registered, Success);
if not Success then begin
WriteLn ('Error in line ', VoterNumber: 1, ' of source file:');
WriteLn ('Incorrectly formatted value in Registered field')
end
else { ... }
The error messages might appear if the character following the voter's name
in the roster file was neither T for true nor F
for false.The handout on characters says that by adding nine leading zeros to an ASCII code, one gets the equivalent Unicode character. Does this only cover the alphabet, or does it include the entire ASCII table?
It includes the entire 128-character ASCII set.
Does each Unicode script have its own punctuation marks or is there one common set which covers punctuation for all scripts?
Not all scripts use punctuation. Among those that do, a few have simply adopted the punctuation marks used in the Latin script, but most have their own punctuation marks.
It was mentioned that some manufacturers use 8-bit ASCII. Is there a standard for it?
There are various competing versions of eight-bit ASCII, some of which are arguably standard, but none has been accepted as widely as the seven-bit ASCII set.
Does ASCII have any way for dealing with accented letters, such as one would find in most European languages? If not, is there a European equivalent to ASCII or does each language pretty much have to create its own standard?
Various organizations and computer manufacturers have proposed extensions to ASCII in which the eighth bit in each byte is used to make it possible to represent 128 additional characters. Unfortunately, these schemes are not consistent either from language to language, or from country to country, or from manufacturer to manufacturer.
You mentioned that there is no standard for an 8-bit ASCII code. Is ISO-Latin-1 an attempt at a standard, or a stopgap, or just a commonly used option? Does this have anything to do with the general decay of standards recently (e.g. many companies introducing proprietary HTML tags)?
ISO-Latin-1 is a standard (``ISO'' is the International Standards Organization), just one that hasn't been accepted as widely as ASCII.
I don't perceive a ``general decay of standards'' -- it's always been difficult to get people to accept and observe them, even when it is demonstrably to everyone's advantage. In computing, at least, standards that have failed to achieve widespread acceptance are far more common than success stories like seven-bit ASCII. Most computer professionals share the cynical attitude expressed in Andrew Tenenbaum's dictum ``The nice thing about standards is that there are so many of them to choose from.''
Does Chr (0) have a standard glyph, or do we always refer
to it as Chr (0)?
There's no glyph for it in the Latin script. Some display devices show it as ^@, reflecting the fact that on many keyboards, including the HP's, you can generate it by pressing <Control/@> (that is, <Control/Shift/2>). Unicode calls it null, and the conventional ASCII abbreviation for it is NUL. I usually call it ``the null character.'' It's a control character, but on many devices it is implemented as having the interesting effect of doing absolutely nothing.
I don't doubt that Unicode is necessary for inclusiveness of the world's many languages, but do you think that it will be difficult to find what you want quickly with such a large set of characters? It seems a bit unwieldly to me.
The theory is that special-purpose editors or editing modes will be developed that facilitate the creation and processing of files of Unicode characters for users of particular scripts. You'll fire up your Tamil editor when you want to write in some language that uses Tamil, put Tamil keycaps on your keyboard, and proceed to type. The editor will look up and store the appropriate Unicode characters in the file that it creates.
What is the purpose of all of those little functions in the assigned handout? What do they demonstrate?
They show how to define names for the non-graphic ASCII characters in a
portable way. For instance, if you want to refer to the ASCII
escape character, the expression Escape is a lot
clearer than Chr (27); but you can't write
const
Escape = Chr (27);
in standard Pascal, because function calls are not allowed in constant
definitions. You could make Escape a variable of type
Char, but then you have to remember to initialize it before it
is used and never to change it. Writing a zero-argument function with the
name Escape allows you to write things like
if Source^ = Escape then Get (Source)just as if
Escape were a constant, but without violating the
Pascal standard.What is a negative acknowledge character used for?
It might be used in a system for sending data over a noisy line. After transmitting some fixed number of bytes, the sender inserts a checksum; the receiver also generates a checksum from the data as received and compares it with the sender's checksum. If they match, the receiver sends back the acknowledge character, and the sender proceeds to the next group of bytes; if they don't match, the receiver sends negative-acknowledge, and the sender repeats the same group of bytes.
On the HPs, does the <Enter> key return two nonprintable characters, like form-feed and carriage-return? I remember you saying different machines do different things when the <Enter> key is pressed.
Strictly speaking, the HP keyboard doesn't generate ASCII characters at all; the signals produced by pressing and releasing the keys are in a completely different code at the level of hardware. (For example, the <Enter> key on the typewriter-layout part of the keyboard and the <Enter> key on the numeric keypad have different hardware keycodes.) The software that mediates the keyboard's interactions with a running Pascal program converts hardware keycodes into ASCII characters.
The HP operating system uses the ASCII line-feed character as a line terminator when text files are stored. In displaying text on screen, the HP terminal emulator places the two-character sequence carriage-return line-feed at the end of each line -- the carriage-return to move the cursor to the left edge of the display and the line-feed to drop it down a line. Still other characters are generated if the text already displayed has to be scrolled upwards in order to make room for the new line.
How can I generate an ASCII form-feed character from the keyboard? How can I insert one into a text file?
From the keyboard, press <Control-L>. In XEmacs, press
<Control-Q> <Control-L>. You may want to write a small
Pascal program just to generate appropriate test data; to have Pascal write
a form-feed to a text file, use Page (DataFile) or
Write (DataFile, Chr (12)).
In the handout you say that Pascal can accommodate EBCDIC, which has gaps
between consecutive alphabetical letters. When you have a case such as
this, what happens to the Succ and Pred
functions? Do they just produce a garbage character?
Exactly. On an EBCDIC machine, for instance, Pred ('J') is
'}' (the right brace or right-curly-bracket).
Why does EBCDIC have gaps between letters? This seems silly to me.
Because the character codes were chosen to make the translation between the pattern of punches representing a character on a punched card and the bit pattern stored in memory as simple, straightforward, and efficient as possible.
A punched card of the era in which EBCDIC was designed could be punched in any of 960 positions, arranged in a rectangle of eighty columns and twelve rows. The rows were conventionally numbered (from top to bottom) as 12, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9; usually the row numbers 0 through 9 were displayed at every punch position across the card, but rows 12 and 11 were left unprinted.
To store a character in one column of a punched card, one would punch out a particular combination of rectangular holes in that column. Digit-zero through digit-nine were represented by single punches in the correspondingly numbered column. Capital letters were represented by pairs of punches, one in row 12, 11, or 0 (at the top of the card) and the other in one of the rows 1 through 9. (Choosing rows that were far apart made it less likely that the card would be torn or otherwise damaged between punches in adjacent rows.) Capital-letter-a was 12-1 -- that is, holes were made in rows 12 and 1 of one column on the card to represent this letter. Capital-letter-b was 12-2, capital-letter-c was 12-3, and so on to capital-letter-i, which was 12-9. Then capital-letter-j was 11-1, capital-letter-k was 11-2, and so on to capital-letter-r, 11-9. Finally, capital-letter-s was 0-2 (avoiding 0-1 because they were adjacent rows), capital-letter-t was 0-3, and so on to capital-letter-z, 0-9.
When these punches were translated into EBCDIC character codes, the
designers decided to have the last four bits of the EBCDIC character
indicate which of the rows 1 through 9 was punched out, with
0001 for a punch in row 1, 0010 for a punch in
row 2, 0011 for a punch in row 3, and so on up to
1001 for a punch in row 9. The first four bits would be set
to 1100 for a punch in row 12, to 1101 (less
logically) for a punch in row 11, and to 1110 for a punch in
row 0. So the EBCDIC code for capital-letter-i is
11001001 -- 1100 for the 12-punch,
1001 for the 9-punch.
However, this leaves a numerical gap between capital-letter-i
(11001001) and capital-letter-j
(11010001), and another between capital-letter-r
(11011001) and capital-letter-s
(11100010). The designers of EBCDIC had a mildly plausible
scheme for filling in these positions with characters that had more exotic
punch combinations. For instance, right-curly-bracket was punched
(on some IBM card punches, anyway) as 11-0, so it was natural to put it
right before capital-letter-j (11-1).
This is probably more than you wanted to know. The real point is that the designers had a sensible motive that became obsolete when punched cards ceased to be a viable input medium.
I'm not quite clear on what the difference is between the letters being arranged in traditional alphabetical order and the letters being adjacent. How can they be in alphabetical order if they have other characters thrown in between?
In EBCDIC, the letters within either case (capitals or lower case) are in alphabetical order with respect to one another -- that is, if one letter precedes another alphabetically, then that letter precedes the other in EBCDIC as well. This is enough to guarantee that lexicographic sorting will work, provided that it involves only letters (of the same case).
If you wish to convert an ASCII character into a 7-bit binary representation, what is the most efficient method in Pascal? It seems to me, since Pascal is so isolated from the hardware and the operating system, that one would have to use a case statement:
case ch of: 'A' : write "1000001"; 'B' : write "1000010"; ...Is the most efficient way to accomplish this? And is it necessary to add the parity bit onto the representation--and if so, is the parity bit 0 or 7?
The parity bit is bit 7. Whether it's necessary to display the parity bit depends on what you're trying to achieve -- showing ASCII code equivalents or exhibiting the contents of memory.
The better way to write out the bitwise representation of an ASCII character is to exploit the fact that it can be deduced from the character's ordinal value:
procedure WriteCharAsBits (Ch: Char);
var
OrdinalValue: Integer;
Bits: packed array [0 .. 6] of Char;
BitNumber: Integer;
begin
OrdinalValue := Ord (Ch);
for BitNumber := 0 to 6 do begin
if Odd (OrdinalValue) then
Bits[BitNumber] := '1'
else
Bits[BitNumber] := '0';
OrdinalValue := OrdinalValue div 2
end;
for BitNumber := 6 downto 0 do
Write (Bits[BitNumber] : 1)
end;
Is it possible for me to read in whatever is entered from the keyboard as
a Char? If it is, then it seems like I can always avoid an
crash caused by entering data in a type which is different from what it
should be, by reading it in as characters and then transforming it into
the type we need (or sending out an error message, if it's not in the right
form). Is that right?
Yes, it is. The only reason for using Read and
ReadLn to read in values of any type other than
Char is that the transformation is sometimes rather difficult.
Recovering a value of type Real from its string representation
is especially tricky and requires a lot of thought about all the cases that
can arise.
What are the 32-bit binary representations of ASCII characters?
When a full thirty-two-bit word is set aside for an ASCII character -- for example, when it is loaded into a register -- usually bits 31 through 7 of the word are cleared (that is, turned off, made zero) and the seven bits of the ASCII character proper, as described in the handout on characters, are stored in bits 6 through 0.
I am curious about how EOLn and EOF are stored
in relation to one another in a text file that is ended with a carriage
return in comparison to one that is not.
The byte-by-byte format of a text file varies from one computer and
operating system to another; Pascal provides EOLn and
EOF functions precisely to impose a layer of abstraction
between the programmer and these differing implementations.
I'll give three examples of text file formats: the one the HPs use, the one the academic VAX uses, and the one that is used on IBM PCs and clones.
On the HPs, a text file is stored as an unstructured sequence of ASCII characters, and the ASCII line-feed is used as a line terminator. In this system, it is possible for a text file to end in the middle of a line: this simply means that the last character in the text file is something other than line-feed. There is no special character to signal the end of the file; instead, the operating system keeps track of the exact number of bytes stored in each file and refuses requests for additional characters after that number of bytes has been released.
On the VAX, a text file is stored as a sequence of ``line records,'' each beginning with a four-byte integer that indicates how many characters of text appear on the line. The text characters are then lined up after this initial count. No character acts as a line terminator or as a file terminator. It is impossible for a file to end in the middle of a line.
On PCs, a text file is stored as a sequence of ASCII characters, with the two-character sequence carriage-return, line-feed as a line terminator. When a text file is created, the bytes between the end of the file and the end of the storage block on the disk are all conventionally filled with the ASCII substitute character, and many text-file applications treat substitute as an end-of-file signal. In that case, a text file can in principle end in the middle of a line, if the last two characters preceding substitute are not carriage-return and line-feed.
Is there a way to peek at the next character of standard input? This
feature seems like it would have made the last assignment much easier to
code. A classmate told me that Input^ accomplishes this task,
but he we always encountered problems in his program when he tried to
utilize this feature. If you can peek at standard input, what are the
limitations/perils of doing that?
Your classmate is correct; Input^ returns the next character
from standard input, without actually advancing past it. (A subsequent
call to Read will still pick up that character.)
There are two main limitations of this method. (1) If input is arriving
from the keyboard, the evaluation of the expression Input^ may
``hang'' the program until the user actually types in the character that
has to be inspected in order for the expression's value to be determined.
(2) You cannot detect the end of the file by checking to see whether the
value of Input^ is the end-of-file character. There
is no end-of-file character in ASCII. (In an interactive Pascal
program under HP-UX, the user can signal the end of interactive input by
pressing <Control/D> at the beginning of a line, but there is
absolutely no way for the program to detect this character -- it is
stripped out before the input is submitted to the program. If input is
redirected so that it comes from a file, <Control/D> is not involved
at all; instead, HP-UX just keeps track of the number of characters in each
file and refuses to give the program a character when they have all been
read. So, in particular, <Control/D> is not an end-of-file
character.)
Do you think enumerated types are more trouble then they are worth?
No, I like enumerated types and believe that they prevent more trouble than they cause. The alternative, which is used in languages that have neither enumerated types nor symbol types, is to define a lot of numerical constants, like this:
const
Juggler = 0;
HighPriestess = 1;
Empress = 2;
{ ... }
Universe = 21;
type
MajorArcanum = Juggler .. Universe;
This is a little cumbersome, but the real problem with it is that one can
then do arithmetic on the values of the data type, and it's too easy to
start making mistakes by performing arithmetic operations that make no
sense.It is a lot of trouble to write input and output procedures for enumerated types, if one uses symbolic names for them, but one would have exactly the same trouble with the input and output of symbolic names if one used integer constants instead.
Is there an easier way to change the character '1' to the integer 1 than:
Ord('1') - 48 { 48 = Ord ('0') }
or, say, to change the string '12345' to the number 12345?No. But if you write a procedure to do this, it will come in very handy in a lot of the Pascal programs you write. Start building a library of useful procedures and functions that can be reused with little or no change in many programs.
What kind of process does Pascal use to translate a series of digits into an integer value?
It examines the digit characters one by one, from left to right. It
recovers the value of each digit character as indicated in the previous
question, by applying Ord to the digit character and
subtracting Ord ('0') from the result. It combines these
individual digit values by multiplying each one by the appropriate power of
ten. The basic loop looks roughly like this:
Value := 0;
while not AtEndOfNumeral do begin
Ch := NextCharacterOfNumeral;
Value := Value * 10 + (Ord (Ch) - Ord ('0'))
end;
That is: Adding one digit to the end of an integer involves multiplying its
previous value by ten and adding the digit.
The actual Read procedure is more complicated than this,
because it has to deal with leading spaces, the possibility of a sign
before the numeral, the possibility that the numeral will be greater than
MaxInt, and so on.
In the handout on numeration, how do the Evaluate and
Express procedures work?
Evaluate works by accumulating the numeric value of the part
of the numeral that it has so far inspected, processing each digit by
multiplying the previous value of the accumulator by the base of numeration
and adding in the numeric value of the new digit.
Express works by recovering the last digit of a given integer,
expressing it as a character to be placed at the end of the numeral, and
then using recursion to deal with an appropriately reduced integer that
contains all but the last digit of the original.
What are some of the practical applications of understanding machine data types, for programmers?
If you understand the machine representations of the data types, you know their limitations and can program around them. Here are two examples of what can go wrong if you don't, from Peter G. Neumann's book Computer-related risks (Reading, Massachusetts: Addison-Wesley Publishing Company, 1995), pages 34 and 169:
During the Persian Gulf war, the Patriot [anti-missile defense] system was initially touted as highly successful. In subsequent analyses, the estimates of its effectiveness were seriously downgraded, from about 95 percent to about 13 percent (or possibly less, according to MIT's Ted Postal; see SEN [Software Engineering Notes] 17, 2). The system had been designed to work under a much less stringent environment than that in which it was actually used in the war. The clock drift over a 100-hour period (which resulted in a tracking error of 678 meters) was blamed for the Patriot missing the [S]cud missile that hit an American military barracks in Dhahran, killing 29 and injuring 97. ... A later report stated that the software used two different and unequal versions of the number 0.1 -- in 24-bit and 48-bit representations (SEN 18, 1, 25). (To illustrate the discrepancy, the decimal number 0.1 has as an endlessly repeated binary representation 0.0001100110011.... Thus two different representations truncated at different lengths are not identical -- even in their floating-point representations.) ...Learning about machine representations makes it less likely that you will be the unfortunate programmer responsible for a mistake like these.In this section, we consider accidental financial mishaps ... One of the most dramatic examples was the $32 billion overdraft experienced by the Bank of New York (BoNY) as the result of the overflow of a 16-bit counter that went unchecked. (Most of the other counters were 32 bits wide.) BoNY was unable to process the incoming credits from security transfers, while the New York Federal Reserve automatically debited BoNY's cash account. BoNY had to borrow $24 billion to cover itself for 1 day (until the software was fixed), the interest on which was about $5 million. Many customers were also affected by the delayed transaction completions (SEN 11, 1, 3-7).
Could you explain how to multiply binary numbers and give an example?
Sure. The multiplication table is very easy, of course: 0 * 0 = 0, 0 * 1 = 0, 1 * 0 = 0, 1 * 1 = 1. If you're calculating on paper, you can set out the work just as if you were working in decimal numeration. For instance, to multiply 42 by 23, you could write
101010
x 10111
-------
101010
101010
101010
0
101010
----------
1111000110
The only hard part is to get the carries right when adding up a long series
of partial products.A computer can do much the same thing, except that it's likely to keep a running total of the partial products instead of adding them all up at the end. Here's how the algorithms might look if they were written out in Pascal:
const
WordSize = 32;
WordSizeMinusOne = 31;
type
Bit = 0 .. 1;
Word = packed array [0 .. WordSizeMinusOne] of Bit;
procedure MakeZero (var W: Word);
var
BitNumber: Integer;
begin
for BitNumber := 0 to WordSizeMinusOne do
W[BitNumber] := 0
end;
procedure Add (Augend, Addend: Word; var Sum: Word);
var
BitNumber: Integer;
Carry: Bit;
ColumnSum: Integer;
begin
Carry := 0;
for BitNumber := 0 to WordSizeMinusOne do begin
ColumnSum := Augend[BitNumber] + Addend[BitNumber] + Carry;
Sum[BitNumber] := ColumnSum mod 2;
Carry := ColumnSum div 2
end
end;
procedure ShiftLeft (var W: Word);
var
BitNumber: Integer;
begin
for BitNumber := WordSizeMinusOne downto 1 do
W[BitNumber] := W[BitNumber - 1];
W[0] := 0
end;
procedure Multiply (Multiplicand, Multiplier: Word; var Result: Word);
var
BitNumber: Integer;
begin
MakeZero (Result);
for BitNumber := 0 to WordSizeMinusOne do begin
if Multiplier[BitNumber] = 1 then
Add (Result, Multiplicand, Result);
ShiftLeft (Multiplicand)
end
end;
Actual processors use a lot of short cuts to speed up the process -- in
particular, they do various parts of the computation in parallel instead of
sequentially.Why is division the most time-consuming operation?
Because almost none of the steps can be done in parallel; the computation of each bit of the quotient depends critically on the outcome of the computation of the previous bit.
For the Negative and Positive procedures in
the integers handout, what happens if the integer is zero?
Both procedures return False in that case.
In the Zero function, you defined the result as comparing
the argument to 0.0. This seems a bit silly to me -- it seems that
round-off errors could easily make a result that's supposed to be zero
return a value of false. Wouldn't a construction like Zero :=
Operand < Tolerance be more appropriate, where Tolerance is an
appropriately defined (small) constant?
That's not a bad suggestion. It would be still better to make
Tolerance an input to the function, and to allow for rounding
errors in either direction:
near-zero
Input: operand and tolerance, both real
numbers.
Output: result, a Boolean.
Preconditions: tolerance is not negative.
Postcondition: result is true if operand
differs from 0.0 by an amount less than or equal to
tolerance (in either direction), false if it differs by
more.
function NearZero (Operand: Real; Tolerance: Real): Boolean;
begin
{ Assert (0.0 <= Tolerance); }
NearZero := (Abs (Operand) <= Tolerance)
end;
A similar NearEqual function, for determining whether two real
values are equal, to within a specified tolerance, would also be useful.
Whether NearZero and NearEqual should
replace Zero and = as implementations of
the zero and equal operations, or whether it would be better
to add them as additional primitive operations, is not so clear. I think
that programmers would find it frustrating to discover that some values
were both ``positive'' and ``zero,'' while others were both ``negative''
and ``zero,'' so I guess I'd favor the latter alternative.
Some of the functions described in the reading don't seem that useful --
the Zero function, for example. It would be just as clear to
write
if A = 0 ...as to write
if Zero (A) ...so it seems to me that the function takes up unneccessary space without adding much clarity. Is there some other reason for this sort of function?
If the compiler recognizes the function, it may be able to generate more
efficient code for the special case that the function deals with than for
the expression it replaces. In the case of Zero, for
instance, many processors have a special instruction that determines
whether all the switches in a given register are off; a compiler can direct
such a processor to place the argument A in a register and
then execute the all-switches-off test. This may be more efficient than
placing A in one register, zero in another, and testing
whether the results are equal.
However, current Pascal compilers will actually generate more efficient
code for the equality test than for a call to Zero, so you're
probably right in thinking that the function is a little pointless. It's
actually there just for pedagogical reasons: I want people to think about
the abstraction first and the implementation second, rather than trying to
guess prematurely (while designing the data type) what the target machine
will do.
In the integers handout, what does Modulo do? How does this
differ from the mod operation?
Modulo extends the mod operation. It is an error
for the second operand of mod to be negative; if the second
argument to Modulo is negative, it still returns a value
between zero (inclusive) and the modulus (exclusive).
What exactly does the modulo operation do?
It determines which residue class the moduland is a member of. A residue class for a given modulus is a set of integers, all differing from one another by multiples of the modulus. The set of natural numbers can be exhaustively partitioned into a number of residue classes equal to the absolute value of the modulus; for instance, if the modulus is 3, the residue classes are {..., -10, -7, -4, -1, 2, 5, 8, ...}, {..., -9, -6, -3, 0, 3, 6, 9, ...}, and {..., -8, -5, -2, 1, 4, 7, 10, ...}.
Each residue class contains exactly one integer in the range lying between zero (inclusive) to the modulus (exclusive), which uniquely identifies the residue class and is the value returned by the modulo operation.
Note that it is negative only if the modulus is negative and does not evenly divide the moduland.
Why doesn't the specification for the integer module include a
DeallocateInt procedure?
It's not in the description of the abstract data type because it's an
operation on storage, not on integers. It's not listed on the assignment
sheet because perhaps not everyone will want to define Int as
a pointer type. However, you should certainly add it if you do use a
pointer type.
Which method do the HPs use to store integers?
Twos-complement representations in thirty-two bits.
In the 2's complement method of storing integers, how does the computer distinguish between positive and negative? It looks like any byte could be either a small positive integer or a large negative integer, or vice versa.
The leftmost bit of the representation indicates the sign. If it is zero (off), then the number represented is either positive or zero; if the sign bit is one (on), then the number represented is negative.
I was reading a book on microprocessors of the late seventies, and it explained twos-complement encoding as such: Ones-complement is the result of switching all the bits in sign-magnitude encoding, and twos-complement is the result of taking ones-complement and adding one (doing any necessary carries). I don't remember it being this simple - is it, or is the author oversimplifying something he doesn't think is important?
Well -- the statement of it that you've cited here is a little confused. I'll try to clear it up.
Signed-magnitude, ones-complement, and twos-complement representations are
three different systems for storing integer values that may be positive,
zero, or negative. All three of them use identical bit patterns for every
positive integer (up to MaxInt, which is the same in all three
systems).
However, they store negative numbers differently, and one can explain the difference concisely by describing the way the negative (or additive inverse) of an integer is computed. In the signed-magnitude system, toggle the sign bit. In the ones-complement system, toggle every bit. In the twos-complement system, toggle every bit and then add one, carrying as necessary -- or, equivalently, toggle every bit to the left of the rightmost 1-bit.
When I explained this in class, I skipped over ones-complement representations, because they are now very rarely encountered and in my opinion just get in the way of the explanation. So it may have sounded as if finding the negative of a twos-complement integer was more complicated than ``taking ones-complement and adding one.'' But it's not complicated.
Also, this book discusses doing math on 32-bit representations of real numbers using eight-bit busses and registers. This seems to me so difficult as to not be worth doing. If doing math with real numbers was so important, why not make the tradeoff and use a bigger, more expensive, 32-bit chip?
Because if you had put a thirty-two-bit processor in an Apple II in 1983, you would have increased its cost by a factor of a hundred.
Why is it that VAXen allocate storage with the half-words ``backwards''? What advantages does this lead to?
The main advantage was that data storage on the VAX was more nearly compatible with data storage on earlier Digital machines. This is why that representation was chosen.
Has the method of representing integers changed much in the last ten years? Is it likely to change in the next ten?
Twos-complement representations were very common -- clearly in the majority -- ten years ago, are overwhelmingly common now, and will be even more common ten years from now. By then it will be difficult to find a functioning computer that does not use twos-complement representations.
However, the number of bits in a word changes more frequently. Ten years ago, if I recall correctly, every computer at the College used either a sixteen-bit word or an eight-bit word. Today, machines with thirty-two-bit words are commonplace. Ten years from now, I expect sixty-four-bit words to be standard.
On page 600 of the Walker textbook, he discusses word length and its effect on integer ranges. I understand the restrictions on the range for 16 bit machines, i.e. -32768 to 32767 or something similar, but I am confused by the next paragraph, when he states that a 32-bit machine can encode about 5 x 10^9 numbers. If I raise 2 to the 32nd power, I only get 4,294,967,296, which is not 5 x 10^9. Also, if one of those bits is a positive/negative flag, then the integer range would be even more limited. I do understand that he is not necessarily talking about integer ranges, but rather storage locations ... but I am wondering where 5 x 10^9 came from.
Your figure is correct; Walker's was a rough estimate. (It came from reflecting that 2^10 is a little more than 10^3, so 2^30 should be a fair amount more than 10^9, so 2^32, or 4 * 2^30, should be a fair amount more than 4 * 10^9; Walker made a guess about the ``fair amount'' that turned out to be a little high.)
Using one of the bits to indicate the sign does not reduce the range of
representable values of type Integer, provided that none of
the integers that are marked as ``negative'' is equal to any of the
integers marked as ``positive.'' For instance, in the twos-complement
representation used on the HPs, there are 2147483648 negative values and
2147483648 ``positive'' ones (including zero); since they are all
different, the total number of values represented is still 4294967296.
In class in Monday you mentioned a datatype that can store an infinite amount of integers. How?
By allocating storage for an integer dynamically, using pointers to link together enough blocks of fixed size to hold all of the digits of the integer.
Are there physically little switches inside the computer turning on and off?
Yes, but the switches don't have moving parts like light switches. Modern computers are completely electronic, which means among other things that the two states of each switch are distinguished not by the physical positions of its components but by some electrical characteristic (low and high capacitance, for instance). Some early computers, however, were electro-mechanical and used mechanical switches operated by relays.
Why are bits in a byte or word numbered from right to left rather than left to right?
Because when a sequence of bits is used to represent a natural number, each bit position corresponds to a power of two (just as in decimal numeration each digit position corresponds to a power of ten). Bit 0 corresponds to 2^0, or 1 (it's the ``units place''), bit 1 to 2^1, or 2 (the ``twos place''), bit 2 to 2^2, or 4 (the ``fours place''), and so on. The bit number matches the exponent.
Walker talks about data being stored in separate bytes or words because computers often have an easier time dealing with whole bytes or words. The trade-off is wasted space. With memory being so cheap, is this a general trend in computing?
Yes. Packed structures are used much less now than when I began teaching computer science in 1983. Programmers now generally think of arranging the fields of a record to conserve storage as an optimization that one performs only on programs that are memory-intensive; formerly it was usual to take alignment problems into consideration whenever one wrote the definition for a record type.
Why is it that some machines can't easily access the individual bits as opposed to just a whole word and others can? Is overall speed greater if this accessing is made trickier? Or is this just an older design?
Partly it is a question of efficiency. If it takes the same amount of time to transfer one bit, eight bits, or thirty-two bits from memory to the processor or vice versa, why bother having a separate address for each bit or even each byte? Instead, just bring in the word containing the relevant byte and have the processor recover the part of the word that is needed.
A second consideration is the number of bits required for an address. Suppose you're designing a machine that can be equipped with 2^k bytes of internal memory. If each byte has a separate address, the addresses will themselves contain at least k bits. If each bit has a separate address, addresses will have to contain k + 3 bits apiece; if the machine is word-addressible, k - 2 bits will suffice (if a word is four bytes). This can make a difference in the complexity and cost of the circuitry.
Is there a way to precedurally determine MaxInt without
already knowing how many bits are in that computer's word?
Of course:
program FindMaxInt (Output);
begin
WriteLn ('MaxInt = ', MaxInt : 1)
end.
What exactly am I looking at when I view a file of integer using a pager
such as more or less? (It looks like gibberish to me.)When the file is created, each integer is copied to the file exactly as it exists in memory, as a thirty-two-bit twos-complement representation. The bits are stored in the file exactly as they are in memory.
When a text-oriented tool such as a pager takes hold of this file, it tries
to deal with it as a sequence of lines, each consisting of ASCII characters
and terminated by a line break -- on our systems, the
line-feed character, Chr (10). It picks up bits
from the file in groups of eight and identifies the ASCII character in each
one. If the ASCII character happens to be a graphic, it displays that
character; if it is a control character, the window performs the
appropriate control operation -- e.g., 00000111 causes a beep,
00001010 causes the cursor to move to the beginning of the
next line, and so on. The result is gibberish, because the bit
patterns that were stored have nothing to do with ASCII characters.
Does the hexadecimal system of numeration serve us any useful purpose? Will we ever need to use this in programming, or is this just an example to help us understand what the computer goes through in processing numbers?
You'll need it. For example, you'll eventually be learning to use debugging tools that can display the values of pointers; conventionally, they are written out in hexadecimal numerals (because the bit pattern of an address enables the programmer to determine whether the storage accessed through the pointer is aligned on a word boundary). Also, programmers occasionally want to read assembly-language listings of their programs, to find out what exactly a particular machine is doing in a frequently executed loop; base-16 numeration is heavily used in such listings.
Are you allowed to enter a number in scientific notation if it is of the
type Real in Pascal, or do you have to set up some procedure
to convert the notation first?
The Read and ReadLn procedures can cope with
scientific notation when reading in a value for a Real
variable -- one more reason why one would prefer to use those built-in
procedures instead of defining one's own.
Can you define a long real in Pascal like you can in C?
You can in HP Pascal; the LongReal type is equivalent to the
double type in C, the Real type to
float. Of course, LongReal is non-standard.
In implementations of Pascal that provide only one floating-point type, it
is usually equivalent to C's double rather than C's
float.
Would a program using fractions of inches be more accurate than a program using metric measurements, since metric is decimal and inches are commonly expressed in fractions of base two (1/16, 1/2, etc.)?
Yes, it would, if it never performed any operation that resulted in a fraction in which the denominator was not a power of two.
Is there a real number equivalent to MaxInt? What would
happen, for example, if I tried to read in a real number whose exponent
component would require more than eight bits of storage?
The nearest analogue of MaxInt is the greatest number that has
an exact IEEE single-precision representation, 2^128 - 2^104; let's call
this number BigReal. If the HP Pascal Read
procedure encounters a numeral for a value slightly larger than
BigReal, it will round the value down to BigReal;
if it encounters a numeral for a value much greater than
BigReal, it will ``round it up'' to the IEEE positive
infinity. Other implementations of Pascal may crash when trying to read in
the numeral for an outsized value.
Is real number storage standardized, or are there many ways of doing this, too?
Unfortunately, there are even more ways of representing real numbers than of representing integers. There is a standard, or rather a small family of closely related standards, that the Institute of Electrical and Electronics Engineers labored over for years. They have had some success in getting computer manufacturers to adopt this standard, but it still isn't as popular as ASCII for characters.
The Java programming language, which is rapidly increasing in popularity, requires that real numbers be represented according to the IEEE standard; if the hardware for a machine that runs Java programs does not represent reals in this way, it is supposed to provide a software simulation of IEEE reals when running Java programs. Possibly this will lead to greater acceptance of the IEEE standard in the next generation of machine designs.
How does the machine store a real number into memory? Does it just round it to a certain number of decimals and then store each digit of the number as an integer, or is there something more complicated that it does?
The story is more complicated; the handout on IEEE representations of reals gives the full account, but I can sum it up briefly by saying that a variant of scientific notation is used: A real number is represented as a coefficient times a power of two, a * 2^b. Part of the storage allocated for a real value is used for the coefficient, a, and the rest for the exponent b. Neither a nor b is stored in exactly the way an integer value is, though, because it turns out to simplify the circuitry that performs arithmetic operations on real numbers if slightly different conventions are used.
After reading Appendix B.2 in Walker's book, I am still unclear about how exponents in real numbers are stored -- in particular, how the machine differentiates between a negative and a positive exponent.
The numeration system that is used for storing exponents is a
``biased-magnitude'' system. Suppose, for example, that we're storing an
HP Pascal Real in the IEEE single-precision format. The
exponent for such a value can be any integer in the range from -126
to 127. A fixed ``bias'' of 127 is added to the exponent,
and then the eight-bit binary numeral for the result is actually stored.
So, for instance, the exponent 5 would be stored as
10000100, which is the eight-bit binary numeral for 132
-- the true exponent, 5, plus the bias, 127.
The bias in this case happens to have been chosen in such a way that the
leftmost bit of the representation of the exponent is 1 if the
exponent is positive, 0 if it is zero or negative, so you
could use that bit to determine the sign of the exponent. In practice, it
is almost never important to know the sign of the exponent without knowing
its actual value.
Is the leftmost bit the least significant bit in the mantissa, or is it the mantissa's rightmost bit?
The rightmost, bit 0, is the least significant.
Why is the greatest number in exact IEEE single precision 01111111011111111111111111111111 and not 01111111111111111111111111111111?
When all the bits in the exponent field are turned on, the IEEE conventions
is that no real number is represented; instead, such a bit pattern
indicates that an operation on real numbers was attempted when one of its
preconditions was not met. The second of the bit patterns shown above, for
instance, might arise as the result of the evaluation of the Pascal
expression 0.0 / 0.0.
My question is about the range of IEEE double-precision real numbers. I understand the origin of the lower limit, 2^-1074, since the exponent comes from -1022 and the 52 bits stored in the mantissa. I don't understand why the upper limit is not 2^1075.
Single-precision reals have a similar restriction on the upper limit (2^128 - 2^104). Why is it not 2^150?
If you consider only normalized representations of real values --
those that don't exploit the all-zeroes setting of the exponent field, and
so have an implicit 1. at the left of the mantissa field --
the lower bound is actually 2^-1022 for double-precision representations
and 2^-126 for single-precision ones. You can get closer to zero only by
using unnormalized representations, with all zeroes in the
exponent field and an implicit 0. at the left of the mantissa
field. There is no analogue of unnormalized representations at the high
end of the system, so you have to stop when you get to the largest
available setting of the exponent and mantissa fields -- there is no way to
shift the implicit binary point any farther to the right.
How widely accepted are the IEEE standards for real-number representation?
I'd guess that more than half of the machines that have Pascal compilers use some variation of IEEE single- or double-precision reals -- fewer if you consider Turbo Pascal to be a form of Pascal (I don't).
Why don't you consider Turbo Pascal to be a form of Pascal?
In the language of the Pascal standard, it is not ``a processor complying with the requirements of this standard'' -- it does not provide all of the standard procedures, does not recognize all of the standard syntax, and does not detect all of the violations that the standard requires it to report.
Also, the object-oriented extensions that Turbo Pascal now incorporates have fundamentally changed the computational model that it embodies. A Pascal computation is a sequence of operations performed by a processor on inert data; a Turbo Pascal computation is an interaction among active objects, each embodying a state. So successful programming in Turbo Pascal is quite different from successful programming in Pascal -- one designs objects rather than procedures, functions, and data types.
Is it very inefficient for the computer to have all of the special cases
in the IEEE real representation system? I know that plugging things like
error messages in so as not to waste the bit combinations when the
exponent is 11111111 saves bit space, but does this slow the
computer down at all? It seems like it would be easier to be able to
perform the same operations on all patterns and to be able to treat them
all as legal values. is this slowdown at all comparable with the greater
number of combinations?
It's not obvious when you look at it for the first time, but the IEEE floating-point representations are actually quite cleverly designed so that detecting and handling these special cases hardly slows the computation down at all. If you're going to use an exponent-and-mantissa representation at all, you have to treat 0.0 as a special case anyway; the marginal cost of handling unnormalized numbers as well is small. Similarly, in order to ``perform the same operations on all patterns and ... to treat them all as legal values,'' you need to have some way of handling the results of division by 0.0; reserving an easily recognized pattern of exponent bits for them is actually the most efficient way to do it.
Why do IEEE representations of real numbers use a biased exponent? Is there some way in which this improves efficiency at the circuit level?
Yes. Since all the biassed exponents are positive, you can find out whether one of them is greater than, equal to, or less than another without paying any attention to signs. Twos-complement and signed-magnitude representations do not have this property and therefore require more complicated algorithms (and hence circuits) for comparisons.
You subtracted one from the cardinality of the Real data type because -0 and 0 are the same. Why do we not do the same for the Integer data type?
Because only one of the 2^32 possible bit patterns is used to represent
0 as an integer; there is no separate bit pattern for a ``negative
zero.'' In the Real type, the distinct bit patterns
00000000000000000000000000000000 and
10000000000000000000000000000000 are both used to represent
0.0.
Is the only thing that is unusual about longreal numbers the fact that they use L instead of E? That hardly seems worthy of calling them a different style of numeration.
That's the only difference in the numeration system that is used to
represent LongReal values in HP Pascal source code. The
internal representation of such values -- the fact that the
IEEE double-precision representation rather than the IEEE single-precision
representation is used -- is more consequential: It means that the range
of LongReal values is much larger than the range of
Real values, and that most LongReal values are
stored to a precision of fifty-three significant binary digits rather than
twenty-four.
Do real-number approximation problems (e.g., round-off error) become more or less pronounced as the number of bits used to represent them increases? In other words, are 32-bit reals less susceptible to round-off error than 16-bit reals?
You're probably meaning to contrast ``double-precision'' representations of real numbers (typically, sixty-four bits) with ``single-precision'' representations (typically, thirty-two bits). I don't know of any system that represents real numbers in sixteen bits.
Rounding errors are just as frequent with double-precision real numbers as with single-precision ones, but they tend to be much smaller, in the sense that the absolute value of the difference between the correct value and the value that is actually stored is less. For instance, the exact value of the single-precision representation of 7/5 is 11744051/8388608, which is too small by 1/41943040; the double-precision representation is 6305039478318694/4503599627370496, which is too small by 1/11258999068426240. A rounding error occurs in either case, but the distortion is less extreme if double-precision representations are used.
How significant can the errors which accumulate as the result of rounding errors be, when the lower limit of a real is 2^-149 and a longreal far, far greater? In what types of applications would such precision be required?
2^-149 is the least positive representable value of type
Real, not the amount of rounding error in the representation
of a typical real. The magnitude of the rounding error is proportional to
the magnitude of the number represented; a normalized value of type
Real can differ from the number it is trying to express by as
much as 2^-24 times the magnitude of the value. So, for instance,
if you're expressing the national debt of the United States (as I write,
$5230766368737.51) as a real number of dollars, the value that is actually
stored is 5230766325760.00 -- an error of almost forty-three thousand
dollars.
In the postcondition for the round operation on real numbers, you
say that if there are two integers that differ from operand by
0.5, result is the even one. In my understanding of rounding,
one would round an operand differing from two integers by 0.5, not to the
even integer, but to the larger.
Different authorities recommend different rounding policies. Standard
Pascal's predefined Round procedure always rounds the halfway
values away from 0.0, which may be what you had in mind as well.
(If the operand is -2.5, is -3 or -2 the ``larger''
value? Standard Pascal rounds it to -3.) I see that I got this
exactly wrong in the handout, so I'm glad to have the opportunity to make
the correction.
The reason I recommend a round-to-even policy in the abstract data type is that I think that it is less likely to produce accumulations of rounding errors that are all in the same direction. In many applications, positive values with a fractional part of 0.5 occur frequently; rounding a long series of such values and then finding the sum can produce a very large rounding error, exaggerated by the fact that all the rounding is in the same direction (upwards). The round-to-even policy will produce rounding errors that tend to cancel each other out in such circumstances.
Walker's text book suggests that to write a binary number in scientific notation we must move the radix point as far to the left as possible, so that it is just to the left of the first 1. This is different from what you lectured on in class as the IEEE representation -- you said that they should be written as 1. something rather than 0.1 something. Is this just a peculiarity of IEEE representation?
It's more a difference in perspective. Walker's book places the notional binary point to the left of the first 1, but claims that the exponent has a bias of 128. The handout I wrote places the binary point to the right of the first 1, but claims that the exponent has a bias of 127. You get the same result in either case, but I think it's easier to add the explanation of how unnormalized numbers work if you present the numeration system as I did.
All this binary stuff we've been doing makes me wonder...is it possible to create machines that are more than bistable (tristable, quadstable, etc.)? If there were three memory states in each bit, the processing power would be much greater. Is this kind of device even remotely feasible?
Yes, but there are two problems with multistable devices: speed and reliability. Binary switches can change state very quickly, and it's also comparatively easy and fast to determine the state of a binary switch. Some mechanical and electromechanical computers of the forties and early fifties, used decimal numeration internally, like old-fashioned adding machines and desk calculators.
According to Knuth, a ternary system of numeration ``was given serious consideration along with the binary system'' during the development of early electronic computers in 1945 and 1946, at the Moore School of Engineering; the somewhat greater complexity of the arithmetic circuitry is partially offset by the greater concision of the representation. (On the average, the number of ``trits'' in the ternary representation of a number is about 63% of the number of bits in its binary representation). Knuth concludes, ``Perhaps the symmetric properties and simple arithmetic of this [balanced ternary] number system will prove to be quite important someday -- when the `flip-flop' is replaced by a `flip-flap-flop' '' (Seminumerical algorithms, volume 2 of The art of computer programming, p. 192).
What is a structured variable? The definitions of different types of variables, such as buffered variables, confused me too.
A structured variable is a variable that has other variables as components -- an array variable or a record variable. In other words, the region of a computer's memory that an array occupies can be divided up into smaller regions, one for each element of the array, and the region that a record occupies can be divided into smaller regions, one for each field of the record.
The discussion of the several kinds of variables on pages 69-71 of the Cooper book boils down to the fact that there are only five ways to refer to storage locations in Pascal:
with-statement, using the field name by itself (in which case
it is a field-designator, the other kind of
component-variable);
type
IntArray = array [1 .. 10] of Integer;
Direction = (North, NorthEast, East, SouthEast, South, SouthWest,
West, NorthWest);
Weather = record
Temperature: Real;
WindSpeed: Real;
WindDirection: Direction
end;
Access = ^Weather;
var
Alpha: Integer;
Vec: IntArray;
Today: Weather;
Tomorrow: Access;
Target: Text;
and that Source has been opened for output, here is an example
of each of the five kinds of variable:
Alpha (entire-variable)
Vec[3] (indexed-variable)
Today.WindSpeed (field-designator)
Tomorrow^ (identified-variable)
Target^ (buffer-variable)
Note that these are exactly the kinds of expressions that can appear on the left-hand side of an assignment statement -- in that position, you're referring to a storage location, so you need a ``variable-access'' expression.
Note also that although a set in Pascal is a structured value, a set variable is not a structured variable -- there's no way to refer to any part of the storage location that is occupied by a set, or to adjust a single element of a set without touching the rest of it. Another way of stating the same point is to say that not all data structures correspond to storage structures.
In what kind of situation would using arrays of more than two dimensions be appropriate? I could see using a three-dimensional array to model a 3-d board game. What about four-dimensional arrays? Are there any real-world problems that use this?
The number of dimensions in an array reflects the number of independent ways of classifying the objects that the array elements count or describe. For instance, an clothing store's inventory program might keep track of the number of men's slacks on hand by updating the elements of a five-dimensional array of the type defined below:
type Waist = 30 .. 42; Inseam = 28 .. 36; Weight = (Light, Medium, Heavy); Fly = (Button, Zipper); Color = (Blue, Brown, Black, Gray, Green, Tan); SlacksInventory = array [Waist, Inseam, Weight, Fly, Color] of Integer;Is there some sort of general rule for when it's best to use arrays, as opposed to linked lists, for data storage?
Yes. It's better to use an array when you need ``random access'' to elements of the structure -- that is, when the order in which the program will examine those elements is unpredictable -- and when you either know in advance how many elements the structure will contain or can at least fix an upper bound that is likely to be approximately correct and certain not to be exceeded. It's better to use a linked list when the elements of the structure will usually be accessed sequentially, from first to last, or when almost all the accesses will involve a few identifiable elements (which can be moved to the front of the list); and when the number of elements in the structure will not be known, even approximately, until the program is running.
Why not implement arrays in such a way that subscripts could be added to them after they are created initially? That way they could grow in the same way as pointer structures, but wouldn't have pointers flying all over.
That's a good idea, and there are several languages -- C++, Java, Common Lisp, and Icon for instance -- in which you can do exactly that. Of course, they all use dynamically allocated structures hooked together with pointers internally, but the details aren't visible to the programmer. But such a data type goes against Pascal's design philosophy of staying close enough to the real machine that the student programmer is directly aware of the mechanics of storage allocation.
The reading for today discussed memory storage of arrays. However, isn't the storage scheme different for different implementations of Pascal? More importantly, isn't it different for diffferent computers?
Some of the details (such as alignments) are different. The general scheme of using base addresses and computed offsets is surprisingly uniform.
Why isn't there a term for ``word-aligned''? It seems useful to define that to me, just because it would be less cumbersome to say that something is word-aligned than 4-byte or 8-byte aligned, depending on machine architecture.
The phrase `word-aligned' is sometimes used. The HP documentation uses `2-byte-aligned', `4-byte-aligned', and `8-byte-aligned' so as not to mislead programmers who are trying to migrate onto HPs from machines on which the word size is different.
What's the point of calling something bit-aligned? You can't set half a bit.
True, but that just means that bit-alignment is the least restrictive of all alignment possibilities, not that it's a pointless concept. It still contrasts with byte-alignment, 2-byte-alignment, and so on. (A value in storage is bit-aligned if it can begin at any bit position within any byte; sometimes this is an important thing to know about it.)
I suppose the complaint is that there's nothing that a bit-aligned value need really be aligned with and no possibility for it to be somehow misaligned (overlapping the boundaries between bits). OK, so perhaps it's a misnomer.
What factors make a single bit in memory easier or harder to access? You gave examples about what makes single bytes easier and harder to access in class on Friday; is the answer to this just an extension of those factors?
Not quite. To recover the value of a single bit from memory, a processor will transfer the contents of smallest independently addressable unit of memory that contains that bit into a register, then ``mask off'' (that is, set to zero) all the other bits in the word, and finally shift the surviving bit rightwards in the register until it becomes the least significant bit. The masking operation takes the same amount of time regardless of where the bit is within the register; on some machine architectures, however, the shifting operation takes longer if the bit starts out farther to the left, so that the leftmost bits are the hardest ones to access.
When declaring variables in our programs, is it important to be thinking about byte alignment and about in what order we declare our variables?
It wouldn't hurt. The commonest situation in which it's really important to think about alignment is when you're writing the type definition for a large array of records; laying out the record in such a way as to minimize padding can make a big difference in the number of bytes required for the array.
How exactly are packed arrays implemented? I was reading your code to get the machine representation of a datum, and you declared a packed array of bits for the machine's unit. Does this mean that packing crams things together as close as possible, and you have to do bitwise operations on registers to recover the part of the machine's word that you want?
Different implementations of Pascal handle the packed keyword
differently. As I mentioned in class, Sun Pascal simply ignored it and
stored ``packed'' arrays in exactly the same way as any other kind of
array. HP Pascal tries to place several elements of a packed array into a
single byte, if they will all fit, but will not store any element in such a
way that it crosses a boundary between two bytes of memory; it will leave
some of the bits in a byte unused if too few bits remain in a byte to
accommodate another element. (However, HP Pascal also offers ``crunched
packing,'' in which not even one bit may be left between elements, even if
some elements have to be stored across byte boundaries.)
Array elements that have been packed more than one to a byte must indeed be extracted by means of bitwise operations before they can be operated on.
I believe I understand how normal arrays are stored in computers, but what about packed arrays? I imagine the answer is implementation-dependent, since some machines are byte-addressable and others word-addressable. Does one simply calculate the offset using the number of bits in an element rather than number of elements?
As I mentioned in class, some implementations of Pascal simply ignore the
keyword packed and store all arrays in the same way. Under HP
Pascal, however, it is possible for two or more small array elements to be
packed into the same word or even into the same byte. The compiler decides
how many elements to pack into one unit of storage (the packing
factor). When the compiler then translates a reference to an element
of the array, its computation of the offset to be added to the base address
includes an extra step; after figuring the difference between the value of
the array index and the lower bound of the index type, the compiler divides
by the packing factor, keeping both the quotient and the remainder. The
quotient is the offset; it is added to the base address to obtain the
address of the unit of storage that contains the array element. The
remainder indicates the position of the element within that storage
location; the compiler uses the remainder to figure out how, after moving
the contents of the storage location into a storage register, it should
mask and shift the bits to strip away all the irrelevant data stored along
with the desired element.
Walker makes it seem as though every single word of storage space has an individual address, like an enormous array. Is this true? Why then don't we simply address the array in the same way the computer does, with the offset being something that we must simply know?
In some programming languages, you can do exactly that. In fact, Pascal is one of them -- a pointer in Pascal is simply an index into the memory considered as an array. This is disguised, in Pascal, by the fact that the language doesn't allow you to write out a string representation of a pointer value or to perform any arithmetic operations on it. In some other languages, such as C and Bliss, there are no such restrictions.
However, the picture of memory as an array of bytes or words, indexed by natural numbers, is not quite correct for some computers, such as IBM PCs and their clones. A memory location on the PC is identified by a combination of two indices: a sixteen-bit ``segment'' value, which identifies some contiguous group of 65536 bytes in the PC's memory, and a sixteen-bit ``offset,'' identifying one particular byte within the segment. Unfortunately, the segments are overlapping rather than mutually exclusive, so that a particular byte of memory can have many different but equivalent addresses -- byte 0 of segment 28 is the same physical collection of eight switches as byte 48 of segment 25, for instance.
In C, array subscipts must start with 0. I thought that this would save a lot of computing time, but in lecture you showed that arrays with subscripts starting from non-zero numbers are negligably harder to initialize than those starting with subscipts of zero. Why does C choose to do this? It seems very inconvenient.
If the subscripts for an array start with 0, the base address of the array and its virtual origin are equal and interchangeable. This simplifies all the computations that are done with the addresses of elements of the array. In Pascal, all the address computations are performed inside the compiler and the author of the compiler is expected to deal with any complications. In C, on the other hand, it is possible to determine the address of any variable or array element and store it in a pointer variable, and also to operate arithmetically on pointer values. The designers of C decided that the extra flexibility offered by array types in which the indices started at some non-zero value was not worth the extra trouble that would be caused by making the distinction between base addresses and virtual origins visible to the programmer.
If I have a packed array and I want to fill it with blanks, can I use the assignment
type
X = array [1 .. 14] of Char;
var
Y: X;
{ ... }
Y := ' '
or do I need to specify each character separately?
The only thing standing in the way of the correctness of the assignment is
that the string of spaces is a packed array of characters, while
the variable Y is just an array of characters. Put the
keyword packed into the definition of type X and
the assignment will be correct.
When I converted a string into a value of an enumerated type, I used a
long if-statement. The string is defined to be 14 characters
long. Since some of the strings I wanted to convert were not so long, I
put spaces in the rest of the positions of the array. Since strings can be
compared only if they have the same length, I also added spaces to the
string constants with which I was comparing the input strings. It looked
stupid. So instead I tried to pad the arrays with null characters
(Chr (0)), deleting the space from the string constants in the
if-statement. But this attempt failed. Is my original method
the only way to solve this problem or is there a better solution?
You can compare two strings only if they have the same length; all the characters of a string are included in this length, even null characters, so padding with null characters is no better than padding with spaces -- in fact, it only makes things more difficult, since there is no way to include a null character in a string constant.
The best way to do this in standard Pascal is to write a special comparison
procedure that allows strings of different lengths to be compared. The
following function returns True whenever it is given two
strings that are alike except possibly for different numbers of trailing
null characters:
function EqualAsStrings (
LeftOperand: packed array [LeftLow .. LeftHigh: Integer] of Char;
RightOperand: packed array [RightLow .. RightHigh: Integer] of Char):
Boolean;
var
Position: Integer;
EqualSoFar: Boolean;
begin
Position := 0;
EqualSoFar := True;
while EqualSoFar and (Position < LeftHigh) and
(Position < RightHigh) do begin
Position := Position + 1;
if LeftOperand[Position] <> RightOperand[Position] then
EqualSoFar := False
end;
while EqualSoFar and (Position < LeftHigh) do begin
Position := Position + 1;
if LeftOperand[Position] <> Null then
EqualSoFar := False
end;
while EqualSoFar and (Position < RightHigh) do begin
Position := Position + 1;
if RightOperand[Position] <> Null then
EqualSoFar := False
end
EqualAsStrings := EqualSoFar
end;
This procedure uses a feature of Pascal called ``conformant array
parameters'' that you may have seen only briefly, in the Cooper book. I'll
discuss conformant array parameters in more detail a little later in the
semester.How does one call a procedure containing conformant array parameters?
There's nothing distinctive about the syntax of the call. The argument corresponding to the parameter must be an array in which the base type matches the base type of the parameter and the index type matches the type of the index constants in the parameter.
The only restriction is that if a conformant array parameter is also a value parameter (as opposed to a variable parameter), the corresponding argument may not itself be a conformant array parameter.
How does the compiler represent conformant arrays in memory? Or does the conformant array specification apply only to parameters passed to subroutines?
Only parameters can be conformant arrays; declared variables, both global and local, must have a fixed size.
When the compiler translates an invocation of a procedure or function that has a value parameter of a conformant array type, it deduces the size of the array that is needed from the type of the corresponding argument and writes out machine instructions that allocate the necessary amount of storage for the duration of the execution of the procedure or function. Some details of the suggested implementation of this method can be found on page 94 of the Standard Pascal user reference manual.
A variable parameter of a conformant array type is handled like any other variable parameter; during the execution of the procedure or function, it is an alias for the corresponding argument. Only the address of that argument is actually passed to the procedure or function.
The worst-case scenario for the quicksort is O(n^2). Obviously, O(n^2) is bad, but is it an efficient or an inefficient O(n^2) method (``efficient'' and ``inefficient'' being relative; O(n^2) is something to be avoided if possible). How would it compare to the other O(n^2) sorts such as the selection sort or the bubble sort?
The worst case of quicksort is bad even for an O(n^2) sort -- worse than selection sort (same number of comparisons, more data movements), though probably not quite as bad as bubble sort.
How does choosing a value that should be the middle speed up a quick sort?
Quicksort examines an element only once and moves it no more than once during each partitioning step on the part of the array that contains that element, so the number of partitions performed on any part of the array is the critical quantity in determining the algorithm's performance. If the pivot element is always in the middle of the range of values, so that the partitioning step always divides the array segment into two equal parts, no part of the array of size n will be partitioned more than lg n times. But if the pivot is always at one end or the other, then there will be some part of the array that always winds up in the larger partition and is therefore exposed to n partitioning steps.
When I was taught quicksort last year in CS1, we were told that one of
the things that made quicksort so quick was the small number of swaps. In
the Walker text, the example of Quicksort includes a number of calls to the
Swap procedure in the two procedures on p. 421,
CheckUp and CheckDown. Instead of
Swap which presumably takes three assignment statements each
time executed, wouldn't it be much faster (especially in a very long list)
to instead store one extra Temp variable of type
DataType, store the first element in the list in this, and
have CheckUp and CheckDown make one assignment
statement?
It might be somewhat faster. Of course, you'd also have to set up a variable of the array's index type to keep track of the ``hole'' in the array -- the position into which the next out-of-place item is to be moved.
Actually, the rendition of quicksort that I usually show to people who are encountering the algorithm for the first time uses a partitioning method that involves even more swaps but is simpler. It looks like this:
{ This procedure runs through the elements in a specified segment of an
array, collecting those that precede a specified pivot at the
low-subscript end of the segment and shifting the rest to the
high-subscript end. The 'divider' parameter keeps track of the
position of the last element in the low-end partition; if there are
no elements in that partition, its value is set to one less than the
lower boundary of the entire array segment. }
procedure Partition (var Arr: ElementArray; Start, Finish: Integer;
Pivot: Element; var Divider: Integer);
var
Position: Integer;
{ counts off the positions in the array segment, from Start to
Finish }
Temporary: Element;
{ temporary storage for an element being moved from one position to
another }
begin
Divider := Start - 1;
for Position := Start to Finish do
if Precedes (Arr[Position], Pivot) then begin
Divider := Divider + 1;
Temporary := Arr[Position];
Arr[Position] := Arr[Divider];
Arr[Divider] := Temporary
end
end;
{ This procedure sorts a specified segment of an ElementArray, using a
recursive quicksort, with Arr[Start] as the pivot of the main
partition. }
procedure QuickSort (var Arr: ElementArray; Start, Finish: Integer);
var
Divider: Integer;
{ the highest-numbered position occupied by an element that precedes
Arr[Start]; if there is no such position, Divider = Start }
Temporary: Element;
{ temporary storage for an element being moved from one position to
another }
begin
if Start < Finish then begin
Partition (Arr, Start + 1, Finish, Arr[Start], Divider);
Temporary := Arr[Start];
Arr[Start] := Arr[Divider];
Arr[Divider] := Temporary;
QuickSort (Arr, Start, Divider - 1);
QuickSort (Arr, Divider + 1, Finish)
end
end;
Does assigning a value to a variable take longer than comparing two
values?It depends on the particulars of the architecture of the machine on which the program is running and on how many times the processor has to access memory (as opposed to just operating on its internal registers) in order to complete the operation. Often the two operations take about the same amount of time.
Is there ever a case where an O(n^2) sort (e.g., insertion sort) will run faster than than an O(n lg n) sort (e.g., heap sort)?
Sure. This can happen if n is quite small, for instance. The O(n lg n) sort may achieve its superior order by making fewer passes over the data, while possibly doing much more work on each pass; if n is small, there may not be enough passes over the data for the O(n lg n) sort to pay off.
The O(n^2) sort may also be more efficient on certain arrangements of the input values. For instance, the insertion sort actually runs in O(n) time if the array is already sorted or almost sorted.
In table 10.3 (p. 423 of Walker's text), what exactly do the numbers mean? Bubble sort compares Descending Data 4950 times, and assigns 14850 times. How does this translate out to actual efficiency? And do some platforms or compilers end up running different sorting algorithms at different speeds?
Processors, operating systems, compilers, and execution environments differ
so widely that the most generally useful way to measure an algorithm's
actual efficiency is to count the number of operations it performs when
given various inputs. The table you're referring to assembles evidence of
this kind for six different array-sorting algorithms and three different
inputs: For instance, the bubble sort algorithm, when given an array of
one hundred integers in descending order and asked to sort it into
ascending order, makes 4950 comparisons between array elements (that is,
it evaluates the condition Info[J - 1] > Info[J] -- see
Walker, p. 403 -- 4950 times) and executes 14850 assignment statements to
copy array elements from one storage location to another.
The running time of an algorithm will of course vary greatly depending on whether it's being executed by a mighty supercomputer or your ten-year-old TRS-80. On a time-sharing system like the HP, the running time will also be affected by the number of other processes that are executing simultaneously. But operation counts are still a useful way to compare different algorithms with the same specification, because in general such environmental differences will speed up or slow down all operations equally, so that the ratios of the running times of various algorithms will be consistent across environments and will be approximately equal to the ratios of their operation counts.
Is it possible to prove the maximum efficiency for a sorting algorithm, and if so, where could the proofs be found?
It's possible to prove that any algorithm for sorting an array by comparing elements is at least O(n lg n) in the worst case. The idea is to record on a list the value (true or false) of each comparison that is done in the course of the sort. For any two initial arrangements of the array, the two lists of comparison outcomes must differ; otherwise, the algorithm would have performed exactly the same operations on both initial arrangements and so could not have correctly sorted both of them. So there must be as many different possible lists of comparison outcomes as there are possible initial arrangements of the array. But an array of size n can be arranged in any of n! ways initially. So n! different lists of comparison outcomes must be possible. But there are only two possible outcomes of each comparison, so this means that the longest of the possible lists of comparison outcomes must have at least lg (n!) entries and represent program runs that involve at least lg (n!) comparisons. But lg (n!) is of order n lg n.
The proof is developed more stylishly in section 9.1 of Introduction to algorithms, by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest (Cambridge, Massachusetts: The MIT Press, 1990), and more exactly in section 5.3 of Sorting and searching, volume 3 of The art of computer programming, by Donald E. Knuth (Reading, Massachusetts: Addison-Wesley Publishing Company, 1973).
Is there any way to tell, when using big-O notation, which type of sort will take the least overhead time to run? For example, I know that heapsort has a larger overhead than quicksort. Does big-O notation tell you anything about this?
To classify an algorithm as, say, O(n^2) or O(n lg n) is to place it in a very large group. Given two algorithms that meet the same specifications but have different orders, one can rely on the conclusion that whichever one has the smaller order will run faster on large inputs. But if the two algorithms have the same order, big-O analysis won't tell you anything about their relative performance.
Both heapsort and quicksort are O(n lg n) algorithms. This implies that the ratio of their running times doesn't change much as you try inputs of different (large) sizes, but it doesn't tell you what that ratio is.
Would the computer have to allocate more storage for an eight-element array of integers or for a record with eight fields, all of which are either integers or characters? What all does storage need to be allocated for in each case?
The storage allocation for an array of eight integers would be thirty-two
bytes, four for each element. The storage allocation for a record
containing eight fields, each of type Integer, would also be
thirty-two bytes. If some of the fields are of type Char
instead, then the storage allocation for the record will be smaller if any
two Char fields are adjacent. If all of the fields are of
type Char, then only eight bytes will be allocated for the
record.
The only reason for allocating more storage than is used to hold the values of the elements of an array or record is to insert padding between elements so that alignments are observed.
Why does Pascal require the use of () to define an empty
field in a record?
Actually, () specifies an empty variant in a record
type specification; a variant can include any number of fields, including
0.
The parentheses are there to prevent the occurrence of a subtle ambiguity.
What actually goes between the parentheses in the specification of a
variant is a field-list, with the same syntactic options as the
field-list that occurs between record and end.
In particular, it is possible to have sub-variants within one variant.
That is, each variant can have its own list of semi-fixed fields (fixed
within that variant), its own tag field, and its own subvariants). For
example:
type
Medium = (Book, AudioCD, Periodical, LaserDisc);
KindOfPeriodical = (Magazine, Journal, Series);
CatalogEntry = record
Author: NameString;
case Tag: Medium of
Book: (PublicationDate: Integer);
AudioCD: ();
Periodical: (Publisher: NameString;
case SubTag: KindOfPeriodical of
Magazine: (IssuesPerYear: Integer);
Journal: (FoundingDate: Integer);
Series: (SeriesTitle: NameString)
); { end of Periodical variant }
LaserDisc: (Diameter: Real)
end;
Without the parentheses, the compiler would not be able to tell whether,
say, Journal was a subvariant of Periodical or an
additional variant of CatalogEntry.Where is it possible to find examples of practical uses of a variant record?
There's one in the upcoming handout on complex numbers. A complex number can be stored either as a pair of rectangular coordinates or as polar coordinates; the two variants are distinguished by a tag field.
Other popular applications include library catalogues (fixed fields for author, date, and place of publication, variant fields for works in various media -- books, records, tapes, microfiches, etc. -- to accommodate their differing attributes) and vehicle-licensing systems (fixed fields for owner, model, license number, etc., and variant fields to accommodate the differences between trucks, cabs, cars, motorcycles, etc.).
What real use are variant records, considering you could do the equivalent of this by simply not assigning irrelevent data to the records that you'd ordinarily be denying access to with the variant record? Is it merely to save space?
That's the most common reason. The other reason is that variant records provide a loophole by which Pascal's strong type-checking system can be defeated by unscrupulous programmers who don't mind committing what the standard calls errors. For instance, the following program reads in and stores a datum as an integer, then inspects it and prints it out as an array of bit values, thus exposing the internal representation of integer values:
program Sneaky (Input, Output);
const
WordSizeMinusOne = 31;
type
Bit = 0 .. 1;
ThirtyTwoBits = packed array [0 .. WordSizeMinusOne] of Bit;
Transformer = record
case Boolean of
True:
(Number: Integer);
False:
(Representation: ThirtyTwoBits)
end;
var
Proteus: Transformer;
Position: Integer;
begin
Write ('Please specify an integer value: ');
ReadLn (Proteus.Number);
Write ('Its internal representation is ');
for Position := 0 to WordSizeMinusOne do
Write (Proteus.Representation[Position] : 1);
WriteLn ('.')
end.
Technically, the expression Proteus.Representation is an
error, since the False variant of the record
Proteus is not active at the point where the reference occurs.
But HP Pascal does not detect or report this error -- which is a useful
feature, if you're trying to find out how integer values are represented
inside the machine.
In a variant record, what effect does removing the `tag' part after the
keyword case have?
If the full tag declaration is present, storage is allocated inside every variable of the record type for a value of the tag type. The tag is an actual field of every record of that type, and the programmer can assign a value to the tag field and subsequently inspect that value. If the tag declaration is reduced to a mere reference to a type, no tag field is allocated. This conserves space, but it means that there is no way to keep track inside the record itself of which variant is active; the programmer has to keep track of that information in some other way.
How is it that variant records use a case-statement in the
type definition? Could you also use if-statements in type
definitions?
Although the reserved word case appears in the definition of a
variant record type, it does not begin a case-statement there.
Statements are executable; parts of definitions are not. In the
definition, case is a mere mark, separating the fixed fields
of the record from the variant selector. No other reserved word can be
substituted for it.
If the use of case in type definitions is so different, why
call it case in the first place?
Niklaus Wirth, the designer of Pascal, perceived a loose analogy between
variants -- alternative ways of structuring storage -- and the alternatives
in a case-statement. The value of the tag field of a variant
record determines which of the variants is active, just as the value of the
case-expression in a case-statement determines
which of the alternatives will be executed.
However, Wirth's decision was entirely a matter of convention. It would
have been equally possible to make, say, variant a reserved
word and use that word to separate the fixed fields of the record from the
variant selector.
Why do the different variants of a single variant record type have to occupy the same number of bytes in memory? If a record was defined as:
type
Example = record
case Space: Boolean of
True: (Num1, Num2, Num3: Integer);
False: (LessMem: Boolean);
end;
If Space is True, then the record will take up 12
bytes (correct?). If Space is False, then the
record will only take up 1 byte. That seems like a lots of efficiency
could be gained. Why does Pascal not do this?
Actually, you neglected to save space for the tag field. A value of type
Example would occupy sixteen bytes on the HPs (one byte for
the Boolean, three bytes of padding to get the alignment right, and four
bytes for each field of the largest variant).
When determining the size of a variable of a variant record type, the compiler must always reserve enough storage for the largest variant, because it usually has no way to determine which variant will be active when the program is executed.
When the storage for a variant record is allocated dynamically,
using the New procedure, it is possible to indicate which
variant you want to allocate by giving an additional argument to
New, as follows:
type
Pointer = ^Example;
var
P, Q, R: Pointer;
begin
New (P, True); { allocates the larger variant }
New (Q, False); { allocates the smaller variant }
New (R); { allocates the larger variant, by default }
However, the economical use of heap storage thus achieved has a price: Once
storage has been allocated in this way, you cannot change it from one
variant to another. You're not allowed to assign a new value to the tag
field of such a record, and you cannot use it as an argument to a procedure
or function or as the left- or right-hand side of an assignment statement.
For instance, after the three calls to New that are shown
above, the assignment P^ := R^ would be an error (and so would
R^ := Q^).How easy is it to lie to Pascal about a variant record? It seems like it would be extremely easy to do, intentionally or accidentally. How does this fit in with Pascal's strong typing in other circumstances?
It depends on the compiler, but I've seen only one implementation of Pascal that detects the error of accessing a field of a variant that is not currently active. Most compilers allow the programmer to lie freely.
Of course, this does not agree at all with Pascal's strong-typing policy, so ruthlessly enforced in other parts of the language. It's a stylistic inconsistency in the design of Pascal.
If you try to make an allocation to a field when that case of the
variant record isn't selected, will that give a compiler error or a
run-time error? Also, would pretty much all calculations and/or
assignments to fields in variant records have to be nested in some
if- or case-statement to avoid this problem?
That seems rather tedious.
Since the compiler doesn't pre-execute the program, it is impossible for the compiler to determine which variant is active at a given point in the program. Accessing a field of an inactive variant is a run-time error.
Yes, it is usual to enclose code that deals with fields of variants in a
conditional construction that checks the tag field first. This is no more
tedious than checking that a pointer isn't nil before
dereferencing it. In both cases, the point of the test is to avoid
performing an operation that is nonsensical, an operation for which the
precondition isn't satisfied.
Is using a variant-record type more efficient than declaring two record
types and using a case-statement in one's program or procedure
to distinguish them? Which setup takes more storage?
Accessing or changing a field of a variant within a variant record is exactly as fast as accessing or changing a fixed field; there's no running-time penalty for using variant records. The alternative approach of using two different record types is not a good idea, since usually one wants to write procedures and functions to operate on such records, or to set up arrays of them, or both. You can't do this unless all the records are of the same type.
Does the Text type have any kind of sub-structure to hold
the characters?
Text files are structured as lines of arbitrary length, but the only
pre-defined operations that deal with those lines as units are
ReadLn, WriteLn, and EOLn; if you
want to perform any other operation on them, you have to write your own
procedures and functions; and if you want to pass lines around as data
structures, you have to define your own type (presumably a string type).
Is there any advantage to storing more complex data types like strings in binary files instead of text files? It seems that there would be some point at which the binary file doesn't make the most efficient use of memory for storing data, but I was wondering if this intuition is valid.
If the same data are stored in a binary file and in a text file, the binary file is almost always smaller, regardless of the type of the data. Input from and output to a binary file may also be faster, because the conversion of the data between the internal representation and the human-readable character representation is avoided (although the time required for this conversion is usually swamped by the time required to for the data transfer to or from the hard disk anyway).
There are only two advantages in storing data in text files: (1) Human beings can read text files more easily. (2) It's usually easier to port text files from one machine to another.
Suppose that I access a text file. The read head is in front of the
first character in the file. I start a recursive loop that involves the
Get procedure on the text file. After the recursive loop is
complete, is the read head in front of the first character of the
file?
No. The recursive procedure accesses the file either through a global
variable or through a variable-parameter that is passed from one level of
recursion to the next. In either case, the side effect of the
Get procedure changes the state of the original file.
When a text file is stored on the disk, how much space does each character take up? The handout seems to indicate that it is 32 bits, but this seems wasteful to me.
Each character occupies eight data bits on the disk.
Do most implementations of Pascal (HP, for instance) provide any non-standard extensions for the use of random-access files?
A lot of them, including HP Pascal, do provide such extensions. In HP Pascal, the most useful repertoire consists of the following group of procedures and functions, which are all predefined:
procedure Open (var LogicalFile: FileOfElement; PhysicalFile: String); function Position (var LogicalFile: FileOfElement): Integer; procedure ReadDir (var LogicalFile: FileOfElement; Index: Integer; var Legend: Element); procedure WriteDir (var LogicalFile: FileOfElement; Index: Integer; Scribend: Element); function LastPos (var LogicalFile: FileOfElement): Integer; procedure Close (var LogicalFile: FileOfElement);The
Open procedure opens a binary file in a way that allows
both reading and writing operations to be performed. If the physical file
(that is, the file as seen by the operating system) already exists, it is
not erased; if it does not, it is created (empty).
The Position function returns the ``current position index,''
a non-negative integer that indicates how many elements of the file precede
the position at which the file window is currently placed.
The ReadDir procedure moves the file window to the specified
position index and copies the value that then appears in the window into
the variable-parameter Legend.
The WriteDir procedure moves the file window to the specified
position index and copies the value of the parameter Scribend
into the file window.
The LastPos function returns the position index of the last
component of the file.
The Close procedure breaks the connection between the logical
and physical files, in effect announcing that all operations on that
logical file have been completed and that the operating system can resume
responsibility for the physical file.
Is there any way to flush a buffer in standard Pascal?
No.
Is it possible to read into a record data from a file in one command if all the information in the file is set up in a certain manner?
Yes, but not from a text file. The file must have been created in the first place by a program that included some such definitions and declarations as
const
NameStringLength = 13;
type
NameString = packed array [1 .. NameStringLength] of Char;
County = record
Name: NameString;
Population: Integer;
Area: Real
end;
CountyFile = file of County;
var
IowaFile: CountyFile;
Current: County;
If our source file had been created in that way, you would be able to use
the same definitions and declarations in your program, and then write
Read (IowaFile, Current)to recover a complete record from the file. Alas,
/u2/stone/datasets/Iowa-counties.dat was not so constructed,
so you'll have to fill in the fields of the record one by one.
I have a textfile and I want to read in the next 30 characters. If I
have an array declared as packed array [1 .. 30] of Char, can
I use the following command?
Readln(text, Arrayname:30)Under HP Pascal, you can read in a string as a unit; the command would be
Read (TextFile, ArrayName)But this is non-standard; in standard Pascal you have to write
for Position := 1 to 30 do Read (TextFile, ArrayName[Position])If I have the declaration
var G: Text;how can I read in text from a file using a standard compiler?
It depends on the implementation. Under HP Pascal, you can use the
standard, one-argument form of the Reset statement:
Reset (G);but only if the file from which you want to read in data is actually named
G. The implementation of the Pascal that the original
designer of the language developed required the user to give the operating
system's name for the file on the command line when activating the program,
and then matched that command-line name to the ``program parameter'' that
appears in the header:
program Whatever (G);Under what circumstances can one use a file without mentioning it in the program header?
A variable of a file type need not be listed in the program header unless it is global; if a variable is used only inside one procedure or function and is a local variable of that procedure or function, it need not (and indeed may not) appear in the function header.
In standard Pascal, one seldom wants to declare a local variable of a
file type, because such files can only be used as ``scratch files'' -- they
cannot exist before the procedure is entered and are discarded when the
procedure is exited, so they serve only as temporary storage for data that
won't fit in memory. If you're using the non-standard two-argument forms
of Reset and Rewrite, however, you can use these
local variables to access or to create permanent files, which are of course
much more generally useful.
If I use the command rewrite(G, 'bats.dat') in a program,
but there already exists a file by that name, will it automatically save
over the old copy?
Yes -- that is, the old version of the file will be erased.
HP Pascal provides two other ways to open an existing file, depending on
exactly what you want to do with it. The predefined HP Pascal procedure
Append opens a file and positions the file buffer at the end
of it; subsequent calls to Write and WriteLn will
add to the file. Append normally takes two arguments, just
like Rewrite: the Pascal file variable and a string giving the
operating system's name for the file you want to append to, thus:
Append (Foundation, 'frogs.dat')There is also a predefined HP Pascal procedure
Open that is a
sort of combination of Reset and Rewrite; it
opens up a file and positions the file buffer at the beginning of the file.
The program can then perform any sequence of calls to Read,
Write, Get, and Put on this file.
In other words, the file is open both for input and for output. Not
surprisingly, this will not work if the file is of type Text,
because arbitrary Write operations into the middle of an
existing file would destroy its line structure. Open can be
applied only to files of programmer-defined types. It takes the same
arguments as Append.You talk about programs being portable a lot. Should I not use my favorite way of opening files?
ReadLn (FileName); Reset (F, FileName)There are two questions here: (1) Should you use the two-argument forms of
Reset and Rewrite, even though they are
non-standard and hence not portable? (2) Should you read in the name of an
input or output file during program execution?
My answer to question (1) is yes: It is practically impossible to ensure
the portability of calls to Reset and Rewrite
anyway, since there is so much variety in the mechanisms that are used to
attach files to Pascal file variables, so you might as well take use
whatever syntax your local compiler prescribes.
The answer to question (2) depends on the specification for the program. If the file names are fixed once and for all in the specification, as in exercise #1, it's better to define them as string constants at the top of the program rather than relying on the program user to type them in correctly whenever the program is run. But if the file names are allowed to vary from one execution of the program to another, you should indeed read them in from the keyboard as you suggest. Some implementations of Pascal make it possible to acquire a file name from the command line used in invoking the program or even from a Finder-like pop-up window; these extensions too are non-standard and non-portable, but may be worth looking into if you're doing a lot of development for a single machine and windowing environment.
If the file implementation and handling procedures are so difficult to deal with in Standard Pascal, then what kinds of solutions have been suggested or used on a proprietary basis? And why haven't revisions of the Pascal standard improved this area of the language -- because of the difficulty on agreeing on one particular solution out of many possibilities?
The usual solution is to look at the facilities that are provided by the operating system under which a particular Pascal system is designed to run, and then to provide non-standard pre-defined procedures and functions that are simply Pascal interfaces to these facilities.
The main reason that the Pascal standard has not been changed to lift some of the limitations having to do with files is that operating systems differ so widely that it would be difficult to graft a universally applicable solution onto the existing language, so that existing Pascal code would still work. Since Pascal is designed mainly as a teaching language, people have been very conservative about changing it since the appearance of the ANSI/ISO standard.
There is a language called Extended Pascal that has a slightly better
implementation of files (it includes, for instance, procedures analogous to
the Append and Open procedures mentioned in the
previous answer), but it has never become popular, and the College doesn't
have a compiler for it.
A question was submitted earlier about the necessity of having files listed
in program declarations. I didn't realize that one could declare, read, and
write files locally! The same should also be true for Input
and Output. These are predefined identifiers; can they be
used locally? --
procedure Foo (Input, Output);
var
GetDatum: Integer;
begin
Read(GetDatum);
Write(GetDatum)
end;
Am I not able to use the Read command if I have not specified
Input in my header? and the same with Write and
Output? It would seem to me that (Input) is a
call to a rather extensive procedure, which defines, among other things,
Read and ReadLn commands. Is this true?
From Pascal's point of view, the identifiers Input and
Output are just variables of type Text that are
initialized before the program proper starts to run. You are required to
put Input in the program header if your program ever refers to
it, either explicitly or implicitly. (Any call to Read,
ReadLn, EOLn, or EOF with no file
argument counts as an implicit reference to Input.)
Similarly, you are required to put Output in your program
header if your program ever refers to Output or calls
Write, WriteLn, or Page with no file
argument. However, batch-mode programs often don't do any of these things
and so need not list Input or Output as resources
in the program header.
However, your procedure won't work as shown, since its parameter list is
not well-formed -- you didn't specify the type for the parameters
Input and Output. And if you do specify a type
for them, the local definitions will supersede the pre-definitions.
Moreover, the no-file-argument versions of Read,
ReadLn, EOLn, EOF,
Write, WriteLn, and Page will still
refer implicitly to the global values of Input and
Output, so that you'll have to list them in the program header
even if you use these procedures and functions only locally.
I have to say I'm surprised that random-access files were omitted from Pascal ... was this perhaps because tape drives were the most common form of storage when the language was originally introduced?
Roughly. My guess is that it was because the designer of Pascal wanted to ensure that standard Pascal could be implemented correctly even on machines that used mainly sequential-access storage; the issue was not so much whether such machines were common, but simply whether to build into Pascal a feature that not all machines could implement.
Why is it that files are stored in double-word alignment?
Because the base type of a binary file might be any sort of value,
including one like LongReal that itself requires double-word
(eight-byte) alignment. Requiring the file variable itself to be
eight-byte-aligned ensures that the file buffer that contains one value of
the base type will also be eight-byte-aligned (since the size of the
control block, 320 bytes, is a multiple of eight bytes).
Is the file input/output buffer set up by the Pascal compiler transparently to the user? If so, why is it necessary, given the way in which modern machines aggressively cache disk data into RAM?
Most (IDE) hard disks I see advertised tout the built-in cache (typically 32K to 128K). Why is this important, if the operating system caches disk data? Or why should the operating system bother if the hardware is caching? It seems at first blush as if the same pieces of data could reside in multiple buffers, which would be wasteful.
When a Pascal program moves data from a hard disk into the processor or vice versa, the data may actually be buffered several times along the way -- in processor registers, in the processor cache, in an operating system's buffer, and in the disk's hardware cache. This speeds up the program, but it is indeed wasteful of other resources, particularly memory -- not just the memory occupied by the buffered data, but also memory used for keeping track of the current position in the buffer and of which elements of a cache have valid, up-to-date values and which ones may need to be updated to reflect changes made since the data was placed in the buffer.
The reason why the tradeoff is a good one is that transferring data to or from a hard disk is an impossibly slow operation from the point of view of the processor, which can execute thousands of instructions while the hard disk is dragging the mechnical arm carring the read-write heads out to the correct track and waiting for the correct sector to come around to the position at which it can be accessed. Anything that can be done to avoid making the trip all the way out to the hard disk for data is advantageous.
The symbol for the buffer is the same as the symbol for pointer-types,
^. Are these two ideas somehow related? Is the computer
somehow using the same function for pointers and file buffers?
No, the two uses of circumflex-accent are independent of one another, and the compiler translates them in completely different ways. This is a flaw in the design of the language.
Why are there two representation of the file buffer variable,
f^ and f@?
When Pascal was designed, there were a lot of computers around that didn't
have circumflex-accent in their character set, and even some of the
computers that did have it were accessed through input devices (card
punches, rebuilt teletype machines) that could not generate that character.
To accommodate such environments, the Pascal standard specifies that the
commercial-at character, @, can be used in place of the
circumflex-accent, anywhere it occurs -- not merely in file buffer
variables, but also in definitions of pointer types and in
pointer-dereferencing expressions.
Can one use the buffer variable (^) to look out for the end
of a file of integers?
No. If you're at the end of the file, inspecting the buffer is an error
and may crash your program. You must call the EOF function
before inspecting the buffer.
Within the Turbo Pascal environment, which I used last year, it was
possible to write and compile ``units,'' or a list of procedures and
functions that could be called as such -- for, say, a unit named
strings:
PROGRAM stoopid (input, output);
uses strings;
BEGIN
{ ... }
END.
While I recognize that this is non-standard, it is very useful. Is there any
way this can be done in HP Pascal?
Yes. The HP Pascal analogue of a Turbo Pascal unit is a module. You can read about modules in chapter 2 of the HP Pascal / HP-UX programmer's guide, or wait for them to be covered in lecture on October 7.
I am not very comfortable with the whole module concept yet. Is it basically a group of procedures that are exported and implemented that you can call on in another program if you import and compile correctly?
Yes, that's right -- procedures, functions, types, constants, and variables, actually. From the application programmer's point of view, having a module available is a little like having an extended version of Pascal to work with -- a version in which there are some extra predefined identifiers. Using the imported procedures, etc., application programmers can write shorter, simpler, more powerful programs.
How does having the program broken up into modules save compilation time? Doen't the compiler still have to compile the same amount of stuff?
Yes, but it doesn't have to re-compile it all when the programmer changes only one module. During the development of a program that includes many modules, most of the modules won't change at all (because they will be taken from well-tested, thoroughly debugged libraries). The compiler won't spend any time recompiling these.
What relationship do modules have to objects in object-oriented programming? They seem to be structured rather similarly (class/local variables, etc.)
An HP Pascal module that gives the interface and implementation of an abstract data type is quite a bit like a C++ or Java class definition and can be used in the same way during software development -- namely, to establish and enforce a contract between the application programmer who uses the abstract data type and the library programmer who implements it. The most important difference is that in an object-oriented language one class can be derived from another, ``inheriting'' most of the other's operations. HP Pascal modules can only import from one another and must therefore implement all their operations.
Does using modules aid in anything other than compiler run-time and reusability?
By accommodating and encouraging an intelligible division of labor among members of a programming team, the use of modules also increases the team's productivity and the quality of the software they produce.
When using a module, do you always have to use $search 'filename.o'$? What exactly does it do?
Yes, the compiler directive is required. Besides the machine instructions for the procedures and functions exported by the module, .o file includes symbol-table information about the identifiers that the module defines, and the compiler examines it to determine, for instance, the types of parameters to procedures and functions.
If you're importing a module that builds upon a module that builds upon a module...how many of those modules do you actually have to search for with the compiler directive? In other words, back to which layer must you search?
All the way to the bottom. According to the HP Pascal / HP-UX reference
manual, ``Pascal requires that lower level modules be included in the
$SEARCH path, even if the higher level modules do not use
them.''
What would happen if you made two programs into .o files and linked them together?
The linker would complain that the (compiler-supplied) identifier that indicates the point at which execution is supposed to begin is defined twice. This is an insuperable error as far as the linker is concerned.
What is the problem that is addressed with an opaque type? More
specifically, in the Sequences module, what is the rationale
behind declaring the pointer type in the export section and
the record type in the implement section? Are there other
ways to address the same problem?
The pointer type, Sequence, has to be exported from the
module; otherwise, application programmers wouldn't be able to declare
variables of that type or invoke any procedures with parameters of that
type. The record type, SequenceRecord, should not be
exported, because it's used only in the implementation -- none of the
procedure or function headers refer to it, and a different implementation
of sequences might use a completely different structure.
The module would still work the same way if SequenceRecord
were exported, but then application programmers would be able to inspect
and assign to fields of such records, tinkering with the internals of the
data structure. This is undesirable, both because they might introduce
errors (disposing prematurely of dynamically allocated storage, for
instance), and because if they take advantage of the privilege of seeing
these details of the implementation, their programs will be
implementation-dependent. When the author of the Sequences
module then decides to rewrite it, using different field names or a
different internal structure, all the application code that refers to the
old field names will break. If the type is opaque and the author of the
Sequences module does not change its interface, the
application programs will continue to work even if the
implement section of the Sequences module is
completely rewritten.
In standard Pascal, there's no way at all to address this problem effectively. In HP Pascal, using a pointer type is the only possibility, since there's no way to export the type without exporting its definition, and only a pointer type effectively conceals the internal structure of the object to which it is pointing. Other programming languages have a variety of more selective mechanisms, different levels of opacity, and so on.
Does using modules affect the compiled code in a positive or negative way?
It makes no difference, when the program is running, whether it was built from one compilation unit or from several. Compiling in modules does not affect either the running time or the use of memory.
What's the best way to take advantage of modules? It seems to me in doing exercise 5 that connecting to preexisting modules takes a lot of time and effort and usually ends up giving you more functions than you want or need. Am I just not planning well, or do modules take a lot of work to import properly?
Setting up to use HP Pascal modules actually goes pretty quickly after you get the hang of it, though getting the hang of it seems to take a long time if you're doing it by trial and error. However, the HP Pascal module design is clearly an alien structure grafted onto Pascal; it's not elegant. There have now been many programming languages -- unfortunately, not the most popular ones -- that included easier-to-use modules successfully in their original design.
When HP Pascal generates an executable file, do unused functions from a module get included? Is the file bigger or more complex than it needs to be? Could I write a file of 100 unused functions and compile it to nothing?
Yes, when you link a compiled module to a main program, the resulting
executable contains all of the functions and procedures from the module,
even those that are never invoked. This is an argument for writing lean,
mean modules (like the Queues module in the handout on radix
sorting) instead of the encyclopedic structures that appear in most of the
handouts.
A smarter compiler could sift through a library module and take only the functions it needs. So far very few compilers are that smart.
I have a question about the heap_dispose directive. Does
it need to be turned on in all parts of a program, or just the parts that
have calls to Dispose in them? For example, in my last
program, I included the queue library. Do I need to include the
heap_dispose directive in my main program to recycle queue
memory, or does the directive in the queue module suffice?
Let's say that a procedure or function ``tries to recycle'' if it invokes
Dispose directly or if it invokes any other procedure or
function that tries to recycle. Memory will not be recycled unless
$heap_dispose on$ appears in every module containing
a procedure or function that tries to recycle.
Ok, I understand how to write the $search directive and the
import declaration for a program that imports a module that
itself imports another module, but I still don't quite understand why it
needs to be done this way. I'd like a greater understanding of the way
modules function.
OK. The basic idea is that any compilation unit A within a program
can use identifiers defined in a different compilation unit B,
provided that (1) B is mentioned in an import
declaration within A that precedes every occurrence of such
identifiers; (2) the .o file resulting from the compilation of
B is accessible to the compiler when it is compiling A, and
the same is true for every module from which B imports identifiers
that are used in its export declaration, and so on
recursively; (3) the import declaration of B in
A is preceded by a $search directive that mentions all
of those .o files.
Here's how I understand the mechanism behind these rules. When the
compiler encounters an import declaration that names certain
modules, its job is to add all the identifiers exported by those named
modules to the symbol table that keeps track of the meanings of all defined
identifiers. So it locates the .o file for that module and
examines the table of identifiers stored in that object file, merging it
into its own symbol table. But that .o file contains references
to identifiers that were imported from other modules. So the symbol table
will not be complete until the tables of identifiers for those other
modules are merged in as well; hence the compiler consults the .o
files for those modules too. The symbol table won't be complete unless the
compiler can find all of these .o files, so they must all be
mentioned in the $search directive.
In effect, a module B that uses imports identifiers from another
module C and uses them in its own export declaration is
re-exporting C, so that when A imports B it must also
import C.
Doesn't it seem stupid that Pascal requires one to keep track of which imported modules themselves import other modules? Why doesn't each module take care of its own imports? This would make module implementation much easier, especially in cases when one is using many many modules spread out all over the place, and even more so if one is using modules that someone else has written.
Standard Pascal doesn't use modules at all, of course. The need to keep track of indirect module importations does strike me as a flaw in HP Pascal, but it's an understandable consequence of having to retrofit a module system onto a language that was not designed for it, while maintaining compatibility with a linker that is itself not very sophisticated.
How would one go about constructing a data type in a Pascal program to represent a complex number as coordinates? I see how you can do lots of operations on complex numbers if you enter the real and imaginary parts of them as separate real numbers or real numbers in a record, but I'm having trouble seeing how Pascal could represent an actual coordinate system on a plane.
There's really nothing more to see -- it's just a matter of looking at what you're already seeing in a different light.
Suppose that an application programmer who wants to use complex numbers
receives the .o file compiled from the ComplexNumbers
module, together with a list of the headers for the functions and
procedures in that module. She could use perfectly well use all those
functions and procedures without ever knowing that the
ComplexNumber data type was implemented as a record, just as
in your programs using HP Pascal files you didn't know that HP Pascal
represented each variable of type Text as a control block
together with a 254-byte buffer. How would such a programmer think of
complex numbers? She'd treat them as if they were one more built-in data
type, without worrying about the separate real and imaginary parts. She'd
use the ReadComplexNumber and WriteComplexNumber
procedures to read them in and write them out without ever taking them
apart into their constituents.
Why should we use polar coordinates for representation of imaginary numbers? It seems that both intuitively, and due to their form, they would be much easier to work with in rectangular coordinates.
That used to be my intuition as well, and addition and subtraction, which are perhaps the most common operations on complex numbers, are admirably simple in rectangular coordinates. But other operations are actually much easier in polar coordinates. Compare the algorithms for division, for instance:
{ rectangular coordinates }
Denominator := Sqr (Divisor.RealPart) + Sqr (Divisor.ImaginaryPart);
Result.RealPart := ((Dividend.RealPart * Divisor.RealPart) +
(Dividend.ImaginaryPart * Divisor.ImaginaryPart))
/ Denominator;
Result.ImaginaryPart := ((Divisor.RealPart * Dividend.ImaginaryPart) -
(Divisor.ImaginaryPart * Dividend.RealPart))
/ Denominator;
{ polar coordinates }
Result.Magnitude := Dividend.Magnitude / Divisor.Magnitude;
if Result.Magnitude = 0.0 then
Result.Phase := 0.0
else
Result.Phase := Dividend.Phase - Divisor.Phase;
To take another example, the exponential function is almost trivial if the
argument is in rectangular coordinates and the value returned is in polar
coordinates; if the value returned has to be in rectangular coordinates,
the most straightforward way to compute it is to find it in polar
coordinates and then convert them.The Divide function you outline states as a precondition that the divisor is not 0.0 + 0.0i. Why would division work if, say, the number was 0.0 + 1.0i?
It works because the magnitude of the divisor is 1.0 rather than 0.0, and that's what's needed for a successful division. To see that this makes sense, start by considering the definition of the imaginary unit: i^2 = -1. So isn't it plausible that -1/i should be i? Similar reasoning shows that the result of dividing a complex number a + bi by i is always b - ai. (Multiplying this quotient by i gives you the dividend back: bi - ai^2 = bi - a(-1) = bi + a = a + bi.)
The main requirement for a division operation is that it should be the inverse of multiplication; complex multiplication has a unique inverse whenever the multiplier has a non-zero magnitude.
How large a worry is the loss of precision in successive function calls in the complex number module?
Immense. When I wrote the module, I was much more concerned to keep the functions simple than to avoid losses of precision. Basically, I wrote all of the functions as if operations on real numbers always produced completely accurate results. Since this premise is false, the module's operations on complex numbers can yield very inaccurate answers. Repeatedly converting values from rectangular to polar coordinates and back doesn't help any, either.
Why didn't you define a procedure for writing to a binary file?
Perhaps that would be a useful addition to the module. Of course, you'd
need a procedure for reading from the binary file as well. In the current
implementation, the user could use the built-in Read and
Write procedures for that purpose; but that approach wouldn't
work if I had decided to use a structure accessed through a pointer for the
ComplexNumber type. (The thirty-two-bit pointer could be
copied into a binary file, but the storage accessed through that pointer
would not be copied along with it.)
The only disadvantage I can see is that the module would have to define and
export a ComplexNumberFile data type, since the input and
output procedures would need a parameter of this type.
It's been a long time since I learned about complex numbers. Where in the real world are they used?
In physics and engineering. For instance, the most natural models of many phenomena in fluid mechanics and electromagnetism involve complex numbers.
In computing, symbolic-algebra packages like Maple and Mathematica perform
many of their computations involving trigonometric functions, series, and
polynomials in the domain of complex numbers. Trigonometric identities, in
particular, are much simpler and more manageable when recast as properties
of the exponential and logarithm functions on complex numbers. (Although
the ComplexNumbers module provided on the handout defines
several complex operations in terms of trigonometric functions, it would be
possible to define them directly as the limits of series. Indeed, this
would probably yield more accurate functions.)
What is the difference between ``statically allocated'' and ``dynamically allocated'' data (Walker, page 268)?
``Dynamic allocation'' refers to the use of the New procedure
to set aside storage locations that can be accessed only through pointers.
The region of memory that is reserved for dynamic allocation is sometimes
called the heap.
Strictly speaking, ``static allocation'' means associating variables with storage locations at the beginning of program execution, for the entire duration of the program. Many programming languages use this form of allocation for global variables. As I mentioned in class, the FORTRAN programming language uses it for all variables, including those in subroutines; this is why recursion doesn't work in FORTRAN.
In Pascal implementations, however, storage for procedure and function parameters and local variables is allocated on the run-time stack, within the activation record for the procedure or function. This is kind of like static allocation, except that the association of a variable with a storage location exists only during the execution of the procedure or function, not for the entire duration of the program. Walker uses the term ``static allocation'' for this arrangement as well.
The Pascal run-time system often treats a main program as simply a procedure that is the first one invoked, so that global variables are allocated as part of the activation record for the main program. This is the set-up that Walker describes on pages 268-271.
Is there a way to check whether or not the call to the New
procedure failed to allocate storage? If so, why haven't any of the
in-class examples checked this? I know it would be cumbersome to check
each time for the success on the procedure for machines with so much
available memory, but it seems like slightly bad style to ignore this.
No, it wouldn't be cumbersome; we would simply define a procedure that
calls New, performs the check, and returns not only the new
value of the pointer but also a Boolean value indicating whether the
allocation succeeded (through an additional variable-parameter).
The problem is that Pascal does not provide any way to perform such a test.
If you call New, and no more memory is available, your program
crashes -- end of story.
Are there a lot of uses for pointers besides the tree-like data structures and chains like we explored last semester?
Most uses of pointers have a family resemblance either to linear structures or to trees, but there are lots of variations that were not mentioned in the 151 course.
What is the difference between sequences and lists?
None of the operations defined for sequences is a mutator, so the size and contents of a sequence are completely fixed for the entire lifetime of the sequence. Many of the operations defined for lists have side effects on their list arguments, so it is usual for the size and contents of a list to change during its lifetime.
Looking at the sequence/list readings, I can't help but think of linear algebra as the most immediate application of these structures. Is this something the designers of Pascal had in mind when they created these?
Perhaps, but they also had in mind a lot of real-world problems for which the programmer needs a homogeneous data structure of a size that is determined during the execution of a program. (Of course, lots of these real-world phenomena are modelled by linear algebra as well.)
What should we consider when deciding whether to use linked lists or arrays? Is it entirely style? Specifically, is it different when dealing with queues and stacks?
No, it's not entirely a matter of style. There are two circumstances in which the linked-list implementation is clearly preferable: (1) when you have no way to know at compilation time what the maximum number of elements in the structure will be, and (2) when the structure will usually contain very few elements, but may sometimes contain many.
In case (1), the linked list gives you some run-time protection against running out of space, whereas with the array you have to make the decision when you define the array type and are stuck with its consequences. (On the other hand, if you do run out of space when using linked lists in standard Pascal, your program crashes; with an array you may be able to report the error in a friendlier manner.)
In case (2), the advantage of using the linked-list implementation is that you save space during most of the execution of the program, since your data structure is small whenever it contains few elements. An array, of course, occupies the same amount of space no matter how many of its positions are actually being used for elements of the data structure.
The advantages and disadvantages of using linked lists are essentially the same whether you're implementing stacks or queues.
What are some uses of the list abstract data type?
Well, you might use it in exercise #1 to keep track of the various counties in Iowa, if you didn't know in advance how many of them there are; in exercise #2, to keep track of the characters in a string as it is read in; in exercise #4, to keep track of the player records or of the ``most similar'' list for a given player; in exercise #5, to keep track of the index entries, or of the page numbers associated with a given entry or, again, of the characters in an index entry; and in exercise #7, to keep track of the voter identification numbers or of a given voter's votes on various propositions.
In general, lists are used when you have a data set that is more often traversed sequentially than accessed randomly, of a size that is usually small but not predictable when the program is written.
In the MergePieces procedure at the top of page 416 in
Walker's text are the lines:
Frankly, I don't understand it. What are the variables
Begin
If Second > Max + 1
Then StopFirst := Max
Else StopFirst := Second -1;
If Second + Size > Max + 1
Then StopNext := Max
Else StopNext := Second + Size - 1;
NewItem := First;
Second, StopFirst, and StopNext?
And what does this code do?
The MergePieces procedure is supposed to combine two adjacent
segments of an array, each of which has already been sorted separately,
into one larger sorted segment. The argument First is the
index of the leftmost element of one of the segments, and
Second is the index of the leftmost element of the other. But
it is also helpful to compute the index of the rightmost element of each
segment. StopFirst is the index of the rightmost element of
the segment that begins at First, and StopNext is
the index of the rightmost element of the other segment. The
else-clauses of the two if-statements quoted
above compute these indices: Since the two segments are adjacent, the
rightmost index of the first segment (StopFirst) is one less
than the leftmost index of the second segment (Second), and
the rightmost index of the second segment (StopNext) is one
less than the sum of its leftmost index and its size.
However, the main MergeSort procedure treats the array that is
being sorted as if it were always exactly divisible into segments of sizes
1, 2, 4, 8, 16, and so on -- on each successive pass, the segment size
doubles. Unless the number of elements in the array is a power of 2, this
assumption doesn't always work -- there will be at least one pass in which
the second of a supposed pair of segments would like beyond the actual end
of the array. The first of the if-statements tests for this
pathological situation; if it finds it, then StopFirst is set
back to be the last true element of the array.
The second if-statement deals similarly with the case in which
the second segment, though it exists, contains fewer elements than the
first. The MergeSort procedure is written as if the second
segment could be allowed to project beyond the real end of the array.
StopNext is set to Max to ensure that it does not
so project. In both cases, the objective is to avoid referring to
non-existent array elements (ones with indices greater than
Max), no matter what mistaken assumptions
MergeSort makes.
The assignment to NewItem doesn't really go with the two
if-statements. NewItem keeps track of the
position in the Temp array to which the next element is to be
copied; the assignment simply initializes it appropriately before starting
the loop that performs the merge.
Is there a spin-off of merge sort that could be used on two lists that aren't sorted, or would it be better to sort each list separately, then merge or combine the lists out of order, then sort?
Merge sort is easily adapted to linked lists. Instead of trying to divide
the list in the middle each time, it is easier to split it into sublists by
transferring alternate elements, as shown in the Split
procedure below.
type
Link = ^Component;
Component = record
Datum: Element;
Next: Link
end;
procedure MergeSort (var Info: Link);
var
First, Second: Link;
procedure Split (var Info: Link; var First, Second: Link);
begin
if Info = nil then begin
First := nil;
Second := nil
end
else if Info^.Next = nil then begin
First := Info;
Second := nil
end
else begin
New (First);
New (Second);
First^.Datum := Info^.Datum;
Second^.Datum := Info^.Next^.Datum;
Split (Info^.Next^.Next, First^.Next, Second^.Next);
Dispose (Info^.Next);
Dispose (Info)
end
end;
procedure Merge (var First, Second: Link; var Merged: Link);
begin
if First = nil then
Merged := Second
else if Second = nil then
Merged := First
else begin
New (Merged);
if First^.Datum <= Second^.Datum then begin
Merged^.Datum := First^.Datum;
Merge (First^.Next, Second, Merged^.Next);
Dispose (First)
end
else begin
Merged^.Datum := Second^.Datum;
Merge (First, Second^.Next, Merged^.Next);
Dispose (Second)
end
end
end;
begin
if Info <> nil then
if Info^.Next <> nil then begin
Split (Info, First, Second);
MergeSort (First);
MergeSort (Second);
Merge (First, Second, Info)
end
end;
How does Mergesort compare to Quicksort? Is it faster? Does the number
of items make a difference in choosing between the two? Do you prefer
either one?Merge sort is usually slower than quicksort, but it has two advantages that would lead one to prefer it in certain cases: (1) Its worst-case behavior is better than quicksort's. In other words, quicksort by comparison is usually better, but when it's bad, it's very bad indeed. Merge sort is more reliable; it takes about the same amount of time no matter what the order of the original data is. (2) The array version of merge sort is stable, in the sense that if two items have equal keys (so that neither precedes the other in the comparison that the sort performs), merge sort preserves their relative order. So, for example, if you take an alphabetized list of student records and sort them by ZIP code, a group of students who all have the same ZIP code will remain alphabetized after a merge sort, but not necessarily after a quicksort. (However, the linked-list version of merge sort, shown above, is not stable.)
The array version of merge sort uses almost twice as much storage as the array version of quicksort, so quicksort should definitely be preferred in cases where the array is so large that one can't easily fit two of them into memory at the same time.
I generally prefer quicksort, since its worst-case behavior is extremely rare, but there are a lot of cases in which either of the two methods would be a good choice.
How does merge sort compare to the other sorting algorithms we have learned? Quicksort did not become particularly efficient unless there was a large number of elements. Is that also true with merge sort?
Merge sort is faster than insertion or selection sorting for large arrays. The arrays have to be somewhat larger to justify changing to merge sort than to justify using quicksort; I'd guess that the turn-around point is around 150 elements.
I notice that the merge sort is very consistent whether its sorting data in ascending, descending, or random order. Is it also consistent no matter what size the list of data is, or is it like quicksort in that it is less efficient than simpler sorts unless the list is quite large?
The running time of the merge sort on an array of size n is consistently proportional to O(n lg n). This means that it sorts more and more slowly as the size of the array increases, and even spends more and more time per element as the size of the array increases, but also that it is progressively faster in comparison to an O(n^2) sort such as the insertion sort or the selection sort as the size of the array increases.
I can see that merge sort is very efficient for combining two large stacks, but how efficient is it when you have many stacks of one? Would it be better to start sorting a large list with one sort and finish with a merge sort?
Just as in the case of quicksort, you can get better performance by sorting the sufficiently small segments with insertion sort or selection sort, and then using the merge sort to combine the sorted segments quickly.
I was thinking about the merge sort, and how efficient it is. For sorting a largish unordered array, you can split the array up into lots of very small subarrays (possibly of size 1) and then merge the pieces. This is kind of slow, but would it be possible to use a parallel processor to make a fast sorting algorithm of this type? iI seems like it might be useful, since you would be able to merge the subarrays independently and therefore simultaneously.
That's right. On a machine that has, say, half as many processors as the array has elements, a parallel version of merge sort is fairly easy to program and would indeed run much faster than the uniprocessor version. If there are fewer processors, the advantage isn't as great, but you could still take advantage of the parallelism on each pass except the last.
Can a MergeSort be done effectively on more than two lists? If so, how does this affect performance?
It's not difficult to write out the code for a variant of
MergeSort that merges three or more sub-arrays or lists at
each step; however, the advantage one gains by having to make fewer passes
over the data is outweighed by the greater difficulty of comparing three or
more items to determine which should go first in the merged sub-array or
list.
Is there a slick way to recursively deallocate the storage associated
with a linked list using the Dispose procedure?
Yes:
procedure DeallocateList (var Head: Link);
begin
if Head <> nil then begin
DeallocateList (Head^.Next);
Dispose (Head)
end
end;
When one deallocates a list, what is it that recycles the nodes to which
pointers no longer point? Is that why one writes the $heap_dispose
on$ compiler directive?
Ultimately, the Dispose procedure is invoked to recycle each
node (the DeallocateList procedure in the handout on lists calls DisposeRest,
which calls Dispose). Under HP Pascal, the compiler ignores
calls to Dispose unless the $heap_dispose feature
is turned on, which is why one writes it at the top of any program or
module in which one wants recycling of dynamic storage.
I'm not sure I really understand how one leaks memory. I know that it happens when one does not deallocate a list in certain situations, but I don't really know what those situations are. What are the warning signs one looks for in order to avoid memory leaks and, briefly, how are the leaks caused in the first place?
A memory leak occurs when a pointer that provides the only remaining way of accessing some dynamically allocated storage location is overwritten or itself becomes inaccessible. In such a case, the dynamically allocated storage is not recycled, but its contents can no longer be inspected or modified by the program. If the program makes heavy use of dynamic storage allocation, the effect of a sequence of such leaks is that the program commandeers more and more of the memory available on the machine until the supply is exhausted, at which point the program crashes (or behaves in some other obnoxious, implementation-dependent way).
To avoid memory leaks, the programmer must keep track of every chunk of dynamically allocated storage and remember to deallocate it before overwriting or otherwise losing touch with her last pointer to it.
Do you have any helpful hints about how to not leak storage? In
order to make the Int data type opaque, I've exported it as a
pointer type to a record which holds the sign and the magnitude. However,
I find myself being extremely concerned with whether or not an existing
storage location would get clobbered. Should I adopt the convention that
functions returning Ints will allocate them fresh? What about
return values from procedures? Should I explicitly deallocate them if
necessary before overwriting their contents?
For functions returning Int, the best solution is indeed to
allocate fresh storage for the value returned, unless the function
always returns one of its arguments unchanged. Values returned
through variable-parameters in procedures should also be freshly allocated.
Procedures should not, however, deallocate their variable-parameters before overwriting them. It would be quite dangerous to do so, since the application programmer may neglect to initialize the corresponding argument before invoking such a procedure, and attempting to deallocate through an uninitialized pointer is a likely to produce a bus error.
I don't understand the business about garbage collection. Why is it necessary to recycle pieces of the program? Why has this never come up before?
It's not necessary to recycle program text -- only to recycle the storage
occupied by values that are constructed and used as intermediate steps in a
computation. This problem is particularly acute for the
Natural data type because people tend to write arithmetic
expressions that have lots of operators in them and hence involve the
construction of many intermediate values.
However, the issue of garbage collection also comes up in connection with other data types -- for example, sequences. If one is constructing a long sequence one element at a time, it's annoying to have to say
Temporary := ConstructSequence (NewFirst, Seq); DeallocateSequence (Seq); Seq := Temporaryrather than
Seq := ConstructSequence (NewFirst, Seq)as one would write in a language that provided garbage collection.
I have a question about how to write a procedure to deallocate a linked list. Here are the definitions of the types I used:
SetPoint = ^SetType;
SetType = record
Size: Integer;
Front: SetPoint;
Rear: SetPoint
end;
You can do it either iteratively or recursively:
procedure DeallocateSet (var Delend: Congeries);
var
Traverser, Trailer: SetPoint;
begin
Traverser := Delend;
while Traverser <> nil do begin
Trailer := Traverser;
Traverser := Traverser^.Rear;
Dispose (Trailer)
end;
Delend := nil
end;
{ or }
procedure DeallocateSet (var Delend: Congeries);
procedure DeallocateList (var Pointer: SetPoint);
begin
if Pointer <> nil then begin
DeallocateList (Pointer^.Rear);
Dispose (Pointer)
end
end;
begin { procedure DeallocateSet }
DeallocateList (Delend);
Delend := nil
end;
Both of these methods presuppose, however, that the linked list is
correctly put together in the first place, with Delend being
nil for an empty list or pointing to the foremost component of
a non-empty list, the Rear pointer of each component pointing
to the next one in line, and the Rear pointer of the last
component being nil. If any of these preconditions fails, the
procedure won't work.Concerning programming in general, do you prefer to use linked-list structures to hold data or do you prefer arrays? When I came to Grinnell, I preferred arrays by 300%, but these days, I hardly ever use them at all. What's your personal favorite?
When I'm writing a program to solve a particular problem, the nature of the problem usually determines the kind of data structure that should be used. I prefer arrays when the problem requires random access to a fixed number of data and lists when the problem requires only sequential access or when no upper bound on the size of the data set is known in advance.
If the problem doesn't favor one structure over the other, I tend to favor the structure that is better supported by the programming language in which I'm working. Pascal and C, for example, support arrays better than lists; the opposite is true in, say, Prolog.
Finally, in a language such as Scheme or Common Lisp that supports both structures well, I generally prefer lists because they seem more flexible and conceptually simpler (at some very abstract level).
Is there a type of file structure analogous/equivalent to stacks? Can one, for example, write a file and then read backwards? Or is it simpler to store an entire array or linked list in a file and read it in when I want to use it? Could this be done with a random-access file? Would stacks ever be large enough in a practical application that memory limitations would make using a disk file more desirable?
You could certainly implement a stack using HP Pascal's direct-access file operations. It would be a lot slower than a stack in memory, but it could in principle have more elements. The stack operations would be just as easy to implement with direct-access files as with arrays.
When Walker discusses using arrays for the stack abstract data-type, is it just for fun, or could there be advantages to using arrays and not pointers?
In some programming languages, you don't have any pointers, so you may need to know how to implement stacks with arrays.
In addition, the array implementation is usually faster that the linked-list implementation and uses less space. The linked-list implementation is preferable only if (1) it's impossible to determine the largest number of elements that will ever be in the stack, or (2) the stack will usually contain many fewer elements than the theoretical maximum.
Are there very many useful applications for a type that can only read from the top like a stack? What are some of them?
Sure -- any time you're modeling a process or structure that is strictly ``last-in, first-out,'' a stack is a plausible choice. Of course, in Pascal, many such processes are modelled by recursive procedures or functions, in which case you usually don't need an explicit stack -- because the run-time stack is doing the job for you.
For example, in expressing an integer as a numeral (a character string), we
found that it was simplest to obtain the digits of the numeral in
right-to-left order (least significant digit first). In the handout on numeration, I compensated for this by
using a recursive procedure, Helper, to stack up the digits.
But it would be possible to use an explicit stack instead:
procedure Express (Value: Integer;
var Numeral: packed array [Low .. High: Integer] of Char;
Base: Integer);
var
DigitStack: StackType;
Length: Integer;
Position: Integer;
begin
InitializeStack (DigitStack);
Length := 0;
repeat
Length := Length + 1;
Push (DigitFor (Value mod Base), DigitStack);
Value := Value div Base
until Value = 0;
for Position := 1 to Length do
Pop (Numeral[Position], DigitStack)
end;
If stacks are so useful, why weren't they included as a data type in the
Pascal language?Because the designer of Pascal wanted to keep the language small and to put the programmer (rather than the run-time system) in charge of dynamic storage allocation. Stacks, though useful, are not as fundamental as arrays or records, and can easily be defined in terms of the data types that Pascal provides; and one of the most widely used implementation types (linked lists) uses dynamic storage allocation.
Would it ever be useful to define a function that added something to the middle of a stack instead of just putting it on top?
Sure. But then you'd call the data structure a list rather than a stack.
Is there any pre-defined or definable function that will accept a stack or set variable and return the number of elements in that stack or set? These types aren't like arrays, so I don't see how you can go in and count their elements, but I thought there might be some other way to accomplish this.
Some authors recommend adding a Size function to the interface
for a Stack data type. Theoretically, the application
programmer can define this function in terms of the other primitives:
function Size (S: StackType): Integer;
var
Tally: Integer;
OtherStack: StackType;
Item: Data;
begin
Tally := 0;
InitializeStack (OtherStack);
while not Empty (S) do begin
Tally := Tally + 1;
Pop (Item, S);
Push (Item, OtherStack)
end;
while not Empty (OtherStack) do begin
Pop (Item, OtherStack);
Push (Item, S)
end;
Size := Tally
end;
But it would surely be more efficient for the implementer to provide it.
Wouldn't it be nice to have an EmptyStack, which would pop
off all of the elements?
Some authors recommend this as well (often under the name
Clear), but in this case it's almost as easy for the
application programmer to write it up herself:
procedure Clear (var S: StackType);
var
Wastebasket: Data;
begin
while not Empty (S) do
Pop (Wastebasket, S)
end;
How about a Peek function that returns the top element off
the stack without actually popping it off?
This is exactly what the operation that Walker calls Top
does.
If the queue data type is implemented with pointers, then how would one
carry out the Full function? There is no count variable in
the pointer record, so I can't see how one would keep track of the size of
the list.
There are three possibilities: (1) Have the Full function
always return false, since there's no fixed upper bound to the
number of components one can allocate dynamically. (2) Measure the linked
list every time Full is invoked:
function Full (Queue: QueueType): Boolean;
var
Traverser: QueuePtr;
Length: Integer;
begin
Length := 0;
Traverser := Queue.Head;
while Traverser <> nil do begin
Length := Length + 1;
Traverser := Traverser^.Next
end;
Full := (MaxQueue <= Length)
end;
(3) Add a third field to the QueueType record, initialize it
to 0 in InitializeQueue, increment it in
Insert, decrement it in Delete, and compare it
with MaxQueue in Full.
As I was standing in line to get my mail at the post office this
morning, I thought of another operation that should be added to the queue
interface. Why didn't Walker include a DeleteItemFromQueue
procedure? That would be useful for simulating a customer who gets fed up
with waiting in line and leaves.
The next thing you'll want is a RandomMillingCrowd data type
to model the line at Cowles Dining Hall.
What would be the advantages of implementing a queue as a doubly-linked list? Wouldn't that change the whole abstract data type? Also, if this would be a great advantage, why not just code it as a doubly linked list with a header all the time?
I don't see any advantage in implementing a queue as a doubly-linked list. Each component would be four bytes larger in order to hold a backwards link which none of the queue operations ever needs, and the queue insertion and deletion operations would have to waste time updating these extra links.
Is there any fundamental difference between queues and stacks other than the order in which they are accessed (which would of course affect the applications)?
No. Considered as abstract data types (that is, apart from their implementation), ``stack'' and ``queue'' don't refer to anything beyond their values and the operations defined on them.
What is the difference between a queue using pointers and a linked list?
You can examine elements in a linked list without first removing them from the list, whereas there are no procedures or functions in the definition of the queue abstract data type for examining elements of the queue. You can insert and delete elements at any position in a linked list, whereas in a queue you can only insert them at the rear and delete them at the front.
The queue abstract data type can be implemented using a linked list structure (just not exercising all the possible ways of operating on such a structure), so as far as how they're laid out in a computer's memory there may be no difference at all.
Why do we need separate objects holding onto both the head and tail of the queue? This seems like a waste. Why not make the tail object a double-linked list that would hold on to both? Sort of like a snake biting its own tail...or rather, a snake's tail biting its own head.
If you used a doubly-linked circular list to implement a queue, you'd have to allocate an extra pointer in every component of the list, since in general a component that is added at the rear does not remain at the rear; other items are added behind it. It's usually more wasteful to allocate four bytes per list component than to perform a one-time allocation of the eight-byte queue header.
Can radix sorting be applied to data types other than records?
Sure. It can be applied whenever the items are to be sorted according to some key that is a short array or can be treated as such. Non-negative integers, for instance, could be sorted by applying the radix sort, starting from the rightmost digit of the decimal representation and working leftwards. You'd have to assume that they were padded with zeroes on the left.
It seems that most things can be interpreted as a key. For example, the number 9845923.2345994 can be interpreted as a string of numbers, with spaces added to the beginning and end as necessary. Why wouldn't one want to use the radix sort generally?
Because the time required to isolate the digits in specified positions of a number might be excessive, or because the number of digits in the number (or, more generally, the number of components of the key) might be so great that the radix sort would spend most of its time examining extremely insignificant digits.
Instead of radix sorting, why not apply quicksort to the queue? (Put the values to be sorted into an appropriate structure such as an array, and apply quicksort.) Isn't radix sorting more useful than quicksort only when dealing with non-numerical characters?
If the key has k components and the queue has n elements, the radix sort requires no comparisons and 2kn data movements; the quicksort requires O(n lg n) comparisons and O(n lg n) in the average case. If k is not too much larger than lg n, therefore, the radix sort can easily be faster than quicksort.
Whether the components of the key are characters or numbers makes no difference to the utility of the radix sort.
Would it be possible to set up a radix sort for a string of characters, provided that the strings were of a fixed length? It seems that if the strings were padded, it could be implemented quite well.
Yes, this would be possible. Remember, though, that you need to set up one of those ``small queues'' for each possible character value, and that the number of passes over the data will be equal to the length of the string.
What size of data set would the radix sort be most efficent at sorting? Does the radix sort vary much in speed if the data is almost in order compared to completely random data?
The relative speed of a radix sort improves as the data set gets larger, since the running time is a linear function of the size of the data set.
As it's implemented in the handout, the original arrangement of the data makes almost no difference in the running time of the sort.
So the radix sort is an O(n) algorithm?
Yes, unless you consider the number k of components in the key to be one of the parameters, in which case it is O(kn).
Why do you make the point of saying "If a key is a short array..."? What's so important about the key being short? I can understand that you don't want lots of queues in memory at any particular point, but is there a more subtle reason for requiring a short array?
The number of queues required depends on the number of alternative values that can be stored in one component of the key, not on the length of the key.
Radix sort works better for short keys than for long ones because it makes fewer passes over the data -- it's that simple. An O(kn) algorithm doesn't stack up nearly so well against, say, an O(n lg n) algorithm if k is much larger than lg n.
What makes the radix sort so fast, from a theoretical standpoint? As implemented by your code in the handout, it only requires a small amount of memory more than necessary to store the entire linked list. The operations of enqueuing and dequeuing are more expensive than simple data swaps. The best I can come up with is that because you know about the structure of the key, it's possible to move things into their position very quickly without comparisons.
Comparison-and-exchange sorts presuppose that the number of possible values for the key is infinite and do not exploit the key's internal structure in any way. The radix sort presupposes that there are only so many different possible values for each component of the key and that the number of components is fixed and finite; this reliance on the key's structure is perhaps the main theoretical difference between the radix sort and the other sorts we've studied.
Does the linked-list implementation of radix sort work for linked lists of variable length?
Linked lists are by definition of variable length -- there's no other kind. The radix sort described in the handout can be implemented with linked lists directly rather than with queues as an abstract data type, and such an implementation would be faster, though perhaps harder to understand.
If sorting by comparison is at best an O(n log n) process, how does the Post Office deliver mail so quickly, aside from having "parallel processing" (lots of workers to work on sorting mail all at once)? Is it that they can divide the work into a subproblem by first sorting on ZIP code and make use of some specialized information about the problem?
The main reason is that the Postal Service doesn't have to completely sort (i.e., arrange in one linear, totally ordered data structure) all of the items in its possession at any one time. Instead, it uses a large number of much smaller sorts, using lots of parallelism and partial sorting.
Also, the principal sorting algorithm that the Postal Service uses is a variant of radix sorting, which is not a comparison-and-exchange sort at all, but works by analysis of the sort key (in this case, the digit-by-digit structure of ZIP codes) and can therefore be faster than an O(n lg n) sort.
How is data sorting implemented in the real world? I've used some large database management programs, and one thing that they all have in common is that upon sorting data, they create a index file specific to that sort (or sometimes, that sort is added to a main index file). Subsequently, if that sort is repeated, they examine the index file, without reperforming the sort. Is such an index file just composed of two integers for each record, one giving its ``unique record number,'' or unsorted position, and one giving its sorted position? Does maintaining such an index make it easier to add records to already-sorted data, either one at a time or in large groups?
The setup you describe is common in situations where the records are very large (so that data movements are extremely slow, while comparisons are fast), perhaps too large or too numerous to be held in memory all at once, and are stored initially in a direct-access file. The index file could have a structure as simple as the one you describe (a list of pairs of integers), or it could be internally structured as a binary tree, a heap, or a hash table (structures that we'll see later in the semester).
What algorithm do database systems like this use? I suspect that it's merge sort, based on your comment about stability, but I can't prove it.
If direct-access operations on files are unavailable or much less efficient that sequential-access operations, some variation of merge sort is almost always used when the files are too large to be copied into memory.
Are cursors just basically pointers pointing at positions that you want to keep track of in a linked list?
If the list is implemented as a pointer structure, as in the handout, then it's natural to implement the cursor as a pointer into the middle of the list.
I don't understand what the find-next-by-test function in the lists with cursors module is supposed to do. Why is it a necessary part of the abstract data type?
Given a list and a Boolean function that can be used to test each element, the find-next-by-test function advances the cursor to point to the next element that passes the test (that is, the next element for which the Boolean function returns true). If none of the remaining elements passes the test, the cursor becomes null.
It's not really a necessary part of the abstract data type, since you could code it using the other features of the module thus:
if NullCursorInList then
CursorToStartOfList (Operand);
Continue := True;
while Continue do
if NullCursorInList (Operand) then
Continue := False
else if Test (ElementAtCursorInList (Operand)) then
Continue := False
else
AdvanceCursorAlongList (Operand)
Why is there no function or procedure in Sequences or
Lists that tests directly whether one sequence is equal to
another?Laziness, together with a general opinion that such a function would not be invoked often. It would make an attractive addition to either module and would be straightforward to code:
function EqualSequences (LeftOperand, RightOperand: Sequence): Boolean;
begin
if LeftOperand = TheEmptySequence then
EqualSequences := (RightOperand = TheEmptySequence)
else if RightOperand = TheEmptySequence then
EqualSequences := False
else if LeftOperand^.Datum = RightOperand^.Datum then
EqualSequences := EqualSequences (LeftOperand^.Next, RightOperand^.Next)
else
EqualSequences := False
end;
I have simple question about why we don't have sorting procedure in the
list-with-cursors module. We talked about quite a lot of sorting algorithms
and their efficiency. However, they don't seem quite as useful when applied
to pointer structures. Are there efficient ways to sort linked lists like
these?This too would be an attractive addition to the module. Insertion sort, quicksort, and merge sort can all be adapted to use list operations:
procedure SortListByInsertion (var Ls: List);
var
Result: List;
FirstElement: Element;
begin
Result := MakeEmptyList;
while not EmptyList (Ls) do begin
FirstElement := FirstOfList (Ls);
DeleteFirstOfList (Ls);
CursorToStartOfList (Result);
Continue := True;
while Continue do
if NullCursorInList (Result) then
Continue := False
else if PrecedesElement (ElementAtCursorInList (Result),
FirstElement) then
AdvanceCursorAlongList (Result)
else
Continue := False;
InsertAtCursorInList (Result, FirstElement)
end;
DeallocateList (Ls);
Ls := Result
end;
procedure SortListByQuicksort (var Ls: List);
var
Smalls, Larges: List;
Pivot: Element;
procedure Partition (var Ls: List; var Smalls: List; var Pivot: Element;
var Larges: List);
begin
Smalls := MakeEmptyList;
Larges := MakeEmptyList;
Pivot := FirstOfList (Ls);
DeleteFirstOfList (Ls);
CursorToStartOfList (Ls);
while not NullCursorInList (Ls) do begin
if PrecedesElement (ElementAtCursorInList (Ls), Pivot) then
PrependToList (ElementAtCursorInList (Ls), Smalls)
else
PrependToList (ElementAtCursorInList (Ls), Larges);
AdvanceCursorAlongList (Ls)
end;
DeallocateList (Ls)
end;
begin
if 1 < Length (Ls) then begin
Partition (Ls, Smalls, Pivot, Larges);
SortListByQuicksort (Smalls);
SortListByQuicksort (Larges);
AppendToList (Smalls, Pivot);
ConcatenateList (Smalls, Larges);
DeallocateList (Larges);
Ls := Smalls
end
end;
procedure SortListByMerge (var Ls: List);
var
First, Second: List;
procedure Split (var Ls: List; var First, Second: List);
var
ToFirst: Boolean;
{ indicates whether the next element extracted from Ls should be
placed in First (True) or Second (False) }
begin
First := MakeEmptyList;
Second := MakeEmptyList;
ToFirst := True;
while not EmptyList (Ls) do begin
if ToFirst then
PrependToList (FirstOfList (Ls), First)
else
PrependToList (FirstOfList (Ls), Second);
DeleteFirstOfList (Ls);
ToFirst := not ToFirst
end;
DeallocateList (Ls)
end;
procedure Merge (var First, Second: List; var Ls: List);
begin
Ls := MakeEmptyList;
while not EmptyList (First) and not EmptyList (Second) do
if PrecedesElement (FirstOfList (First), FirstOfList (Second) then begin
AppendToList (FirstOfList (First), Ls);
DeleteFirstOfList (First)
end
else begin
AppendToList (FirstOfList (Second), Ls);
DeleteFirstOfList (Second)
end;
ConcatenateList (Ls, First);
DeallocateList (First);
ConcatenateList (Ls, Second);
DeallocateList (Second)
end;
begin
if 1 < Length (Ls) then begin
Split (Ls, First, Second);
SortListByMerge (First);
SortListByMerge (Second);
Merge (First, Second, Ls)
end
end;
What would you do if you wanted to copy a list and leave the cursor at
the same place in the copied list as it was in the original list? Although
I understand why you would want to set the cursor at nil, it seems that
retaining the cursor position in a copied list would be useful sometimes
also.In most cases, I suspect that it would turn out to be equally convenient to create the duplicate list before placing the cursor in the original and then to move the cursors of the original and the duplicate in parallel. I'm reluctant to add a cursor-preserving copy operation to the interface, both because it's hard to implement efficiently and because it breaks the three simplifying principles listed at the beginning of the section on implementation in the handout.
One of the main differences in the datatypes described in the sequences module and the lists module is the addition of a size datum in the lists modules. Is this only an addition of a procedural nature, or does the nature of the list as a container object necesitate a size datum?
The fact that a container object such as a list can change its size after creation implies that the length operation may be applied to it more frequently than to a sequence.
Why not add or delete from in front of the cursor instead of behind?
How then would one add an element at the beginning of a list, or delete the first element?
In the current version, the addition or deletion is not really ``behind'' the cursor; the cursor is positioned at the exact point at which the addition or deletion is to be made.
Are containers in general different from object that aren't containers?
Yes. The key difference is that containers are mutable -- you can apply operations to a container that modify its contents without creating a new container. Applying an analogous operation to a non-container object, such as a sequence, results in the creation of a new object.
Are there any differences between container and non-container types, apart from the fact that containers are mutable?
No. There are other ways of describing the difference, though; another way to put it is that sequences (for instance) are like constants, while lists (for instance) are like variables.
Section 5.strings of the HP Pascal/HP-UX programmer's guide reads:
A string is allocated four bytes for its current length (an integer), byte per character, and one ``housekeeping'' byte. The number of characters is the string's declared maximum length. The ``housekeeping'' byte is only accessible to some of the standard string functions.What is a ``housekeeping'' byte? And what functions access it?
Some algorithms can be formulated so as to execute faster if an extra
storage location is used. (For instance, a linear-search algorithm can be
speeded up by placing a copy of the value sought in an extra storage
location at the end of the data structure; since the search will always
``succeed,'' it is not necessary to execute the test that detects the end
of the structure in the search loop.) Most likely, the housekeeping byte
in an HP Pascal string is used to make it possible to use such algorithms
in the library of predefined string functions; I notice that one of them,
StrPos, performs a search.
Since the housekeeping byte is used only inside the string functions, one can observe its value only by sneaky methods (inspecting memory through the debugger, cheating on variant records, and so on) -- the idea is that it's protected storage, used in HP Pascal's implementation of strings but supposedly not visible to the HP Pascal programmer, who sees only the abstract interface.
In Wednesday's class you mentioned that the data structure you used for
the Strings module consists of a linked list of arrays, each
array being eighty elements long. Wouldn't it be less wasteful to use
arrays with fewer elements, say ten, to avoid empty space at the end of the
array when storing small strings?
Certainly, if most of your strings are small. I chose to set
BlockSize to 80 on the assumption that most of the strings
that the module would be used for would be complete lines of text. If your
application deals mainly with single words, it would be a good idea to
change the definition of BlockSize to 8 and then recompile the
module.
In the Strings handout, in reason three as to why Pascal does a poor job of implementing strings, you state:
There is no way to refer to, construct, or operate on the null string.Is the null string a string of length zero or a string composed entirely of chr(0)? If the null string is a string of length zero, then I see why it is not possible to construct it. However, couldn't a string consisting of all chr(0) be referred to in the same way as a string of length zero, since Pascal's built-in string type allows lexicographic comparisons? In other words, if
nullstring is a
string of length zero and nullstring2 is a string of a defined
length composed of chr(0), wouldn't
if string1 = nullstring then ....be equivalent to
if string1 = nullstring2 then ...
The null string is a string of length 0.
Strings consisting of repetitions of the character null are distinct
from the null string and from one another. Of the two Boolean conditions
you have described, the first is not standard Pascal (since there is no way
to define a standard Pascal string of length zero), so I don't see how it
could be equivalent to the second one. On the other hand, in the
Strings module, the condition
EqualStrings (NullString, FillString (Len, Chr (0)))will return
False unless Len is zero.I'm having trouble understanding the ``string pool'' implementation of strings. I can't visualize the abstract type the way it is described in the handout.
Here's a picture of a string pool containing two strings, "horse" and "cat":
marker = 9
|
V
-----------------------------------------------------------
| h | o | r | s | e | c | a | t | | | | | | | ...
-----------------------------------------------------------
^ ^
| |
Start = 1 Start = 6
Length = 5 Length = 3
Each string is represented by a handle -- a record containing the
position at which the characters in the string begin and the number of
characters in the string. The marker keeps track of the
lowest-numbered unused position. (The arrows in this diagram don't stand
for Pascal pointers; they're just indicating the positions in the pool that
have the specified subscripts.)If the program next generated the string "Fargo", the pool would look like this afterwards:
marker = 14
|
V
-----------------------------------------------------------
| h | o | r | s | e | c | a | t | F | a | r | g | o | | ...
-----------------------------------------------------------
^ ^ ^
| | |
Start = 1 Start = 6 Start = 9
Length = 5 Length = 3 Length = 5
You suggest a number of different ways to implement the string data type,
none of them completely desirable. How is Pascal's built-in string data
type implemented?
I presume you mean the HP Pascal String types (standard Pascal
strings are fixed-length character arrays). HP Pascal uses the first of
the five alternative implementation types that I proposed: An HP Pascal
variable-length string consists of an integer indicating the strings
current length and an array in which the characters making up the string
are stored. Since the array's size is fixed when the string is declared,
each string has a maximum size that cannot be exceeded.
I've noticed that with both the string and queue modules, you implement
the string type and the queue type as a pointer to a record rather than
just a record. This seems weird to me. The advantage I see is that you
can create and destroy these data types at will, so you're not limited by
the number you declare. However, I can't really think of a situation in
which you'd want to have that kind of dynamic number of queues or
strings--it seems that you would know the number of data structures needed
in advance. Also, there's the problem of having a null pointer. Granted,
it's possible to use the Assert procedure to make sure the
pointer points somewhere, but since the procedure that checks to see
whether a queue is empty looks at the head and tail pointers, not the
pointer to the header record. So, my question is why did you choose to
introduce this additional level of complexity? Is there something I'm
missing?
The basic reason is to ensure that the types are opaque. Without the additional level of indirection, the module programmer would have to export the record type, thus making its field names visible to the application programmer, who could synthesize bogus structures not conforming to the invariants that the module imposes and generally write code that depended on implementation details instead of treating queues and strings as abstract data types.
A side advantage is that in most implementations of Pascal it is faster to pass a pointer to a value-parameter than to pass a record. (In either case, a complete copy of the value must be made, and pointers are generally smaller than records.)
In the EveryCharOfString function, you use a goto to escape
from the loop early if a character which fails the test is found, and I
don't understand why. Couldn't you just use a while-loop in the place of
the for-loop, as follows:
Position := 0;
while (Position < StrSize) and (not Found) do begin
if PositionInBlock = BlockSize then begin
Traverser := Traverser^.Next;
PositionInBlock := 0
end;
Position := Position + 1;
PositionInBlock := PositionInBlock + 1;
if not Test(Traverser^.Data[PositionInBlock]) then begin
Found := True;
EveryCharOfString := False
end
end
Is there an efficiency problem with that, or am I missing something
subtle?
No, you could also do it that way (remembering to initialize
PositionInBlock to 0, Found to
False, Traverser to Str^.Head, and
EveryCharOfString to True before entering the
loop, of course). The version I give is only slightly faster (the
difference being that it doesn't have to check Found every
time through the loop), and your version may be easier to understand. I
just happened to think of the goto version first.
How often will we come across situations in which strings are a useful solution? Obviously, in exercise #5, strings are not too useful as we may need to change the size of space allocated to an entry. In your experience, how often do strings become useful?
If you're talking about standard Pascal strings, they are often useful, but I agree that you have to jump through hoops to work with them in exercise #5. On the other hand, strings as an abstract data type, as described in the handout on strings, are very generally useful -- and would greatly simplify exercise #5 in particular.
In the reading, one of the advantages given of a doubly-linked list is that one can print out the data in reverse order. However, can't this easily be done with a recursive writeout procedure performed on a singly-linked list? I've done that myself in more than one occasion, instead of allocating the extra storage for the second pointer.
Yes, it can. The difficulty is that if the list is extremely long, the number of levels of recursion required can exceed the capacity of the run-time system. Since the output procedure is not ``tail-recursive'' (the whole point is that it must write out the datum in the current component only after the recursive call to handle the rest of the list has been completed), this can happen even when recursive calls are optimized fairly effectively.
Is there another implementation for doubly-linked lists besides the pointer implementation? Most other abstract data types have had multiple choices for implementation method.
Sure; for instance, you can implement a bidirectional list as a record
consisting of an array (within which all the elements of the bidirectional
list will be stored) and three fields containing indices into the array:
one to indicate where the first element of the list is stored, one to
indicate where the last element is stored, and one to give the location of
the cursor. Adding one to the index of an element moves you in the
Aft direction; subtracting one moves you in the
Fore direction. At the end points, the array ``wraps
around,'' so that the highest-index position is in the Fore
direction from the lowest-index position and the lowest-index position is
Aft of the highest-index position.
Insertions and deletions in the interior of this structure are less efficient than their linked-list equivalents, because all the elements on one side or the other of the point of insertion or deletion have to be moved. Also, of course, this implementation sets a compile-time upper bound on the number of elements that the bidirectional list can hold. For these reasons, the pointer structure is far more commonly used.
Are there uses for doubly-linked lists other than trees?
Trees aren't lists at all, because the elements of a tree aren't arranged in a linear order.
Bidirectional lists have many uses. The handout on bignums illustrates one of them; problem 5.6 in Walker's book (pages 241 through 248) presents another.
After reading the handout on doubly linked lists, and the example of the golf tournament, I was wondering if there were applications for multiply-linked lists (i.e., containing more than two pointers). I thought of one example where it might prove useful.
A police department might find it useful to maintain an address and phone book file capable of almost instantaneous retrieval; they might wish to have a set of data which included first name, last name, street address, phone number and possibly ZIP code. Could a each data field then contain five different pointers so as to have the data sorted by each aspect? Or would this prove so problematic in inserting and deleting items, not to mention simply coding the program properly, that it would be more beneficial to simply use a very efficient sorting method and run through the data each time a search is done by a different field than the last search?
It would seem that it might take a considerable amount of time for a police department in a large city, so perhaps the extra trouble of maintaining the pointer lists would be better ... though of course a linked list is not as easily searched as an array.
One could indeed have a component type containing fields for first name, last name, address, telephone number, and ZIP code and five pointers to other components, and use the five pointers in each component to construct five singly-linked lists, all containing the same components, but in different orders. (Though it is clearly a multiply-linked structure, once again I hesitate to call it a list unless one of the orders is thought of as more fundamental than the other four.)
Records containing multiple pointers are very widely used in contemporary programming. The most common arrangements are a linear order (list) and a hierarchy (tree), just because the algorithms for dealing with these relatively well-behaved structures are easier; but ``spaghetti structures'' like the one you described are also frequently encountered.
When would doubly-linked lists be better to use than singly-linked lists, since the extra link costs more bytes? It would depend on whether run time was more important or storage space was more important. It almost seems like doubly-linked lists would be better if the amount of information you were working with is small, but if the information was small the time saved with doubly-linked list would hardly be noticable when compared to the singly-linked list.
Doubly-linked lists are better when the application calls for traversing the structure in both directions or for moving the cursor back and forth over parts of the structure to scan them repeatedly. I can't think of a case in which one would prefer them to singly-linked lists solely because of the number of elements in the structure.
Is it possible to do binary insertion with a bidirectional linked list?
Not efficiently. There's no way to get to the middle of the list in constant time.
Are doubly-linked lists easier to sort than singly-linked lists?
On the contrary, the extra pointers mean that you do twice as much work when moving a list component from one position to another. And most of the sorting methods in which the data are shifted around from one component to another within a list whose pointers remain fixed work equally well regardless of whether the list is singly or doubly linked. The one exception is the shaker sort, which involves traversing the list alternately from head to tail and from tail to head; but I wouldn't recommend using the shaker sort anyway.
Is it possible to create a very simple long integer package by defining
an array from 1 to, say, 20, and store the individual representations of
each numeral, therefore completely eliminating a need to pay attention to
MaxInt and MinInt?
It's certainly possible to define one's own LongInteger type,
and in fact we'll be doing almost exactly that when we discuss ``bignums''
in November. It's not quite as simple as it sounds; in fact, getting the
division operation to work quickly and correctly is surprisingly
intricate.
If you use an array representation, you'll still have to pay attention to
MaxInt and MinInt; they'll just be farther from
zero than before. Bignums use pointer structures that are allocated from
the heap, so all one has to worry about is running out of memory.
How is a number in base 1290 represented in Naturals? Is it some sort
of binary numeral? Does it have a binary decimal?
A value of the Natural data type is a bidirectional list,
implemented as a doubly-linked list with a header not, in which the
individual components are digits in base 1290 (that is, members of the
subrange type 0 .. 1289). Take, for example, the natural
number one trillion (1000000000000), which is comfortably greater than
MaxInt. 1000000000000 = 465 * 1290^3 + 1075 * 1290^2 + 548 *
1290^1 + 580 * 1290^0, so the natural number one trillion is represented as
a bidirectional list containing four components, thus:
+-----------------+
| 4 |
+-----+-----+-----+
| . | nil | . |
+--|--+-----+--|--+
| |
+------------------+ +------------------+
| |
V V
+-----+------+--+ +--+------+--+ +--+------+--+ +--+------+-----+
| | 465 | -+-->| | 1075 | -+-->| | 548 | -+-->| | 580 | nil |
| nil | | |<--+- | | |<--+- | | |<--+- | | |
+-----+------+--+ +--+------+--+ +--+------+--+ +--+------+-----+
Within the individual components, the individual digits are stored as
values of the subrange type: two bytes are allocated for a value, and it's
stored as a binary representation.
I understand why using a large base would save space, but I don't
understand why it can't be larger than the cube root of MaxInt +
1. Why?
The division algorithm presupposes that we can use Pascal's built-in
div and mod functions to operate on numbers
expressed by numerals containing as many as three digits in whatever base
of numeration we settle on. (For example, to estimate the first digit of
the quotient accurately, the algorithm sometimes divides the first three
digits of the dividend by the first two digits of the divisor.) So,
in whatever base b of numeration we pick, every three-digit numeral
must express a number less than or equal to MaxInt. The
largest number expressed by a three-digit numeral is b^3 - 1; since
b^3 - 1 <= MaxInt, b <=
(MaxInt + 1)^(1/3).
How much slower are the operations on natural numbers than on the integer built-in data type?
I haven't measured them, but I imagine that they are about a hundred times
slower for values in the Integer range.
Is it possible to prove that there is a best-case division algorithm?
I don't know of any such proof, and I suspect that the optimal division algorithm has not yet been discovered, let alone proven to be optimal. I implemented the Brinch Hansen algorithm because it is pretty fast and retains some resemblance to the pencil-and-paper method. Knuth, however, claims (on p. 264 of Seminumerical algorithms) that for very large operands a variation of Newton's method is much faster than the classical algorithm.
Is division of the built-in integer data type implemented in software by the Pascal compiler or in hardware by the microprocessor?
In software. The PA-RISC instruction set includes a ``divide step'' instruction that apparently performs a combination of elementary operations that is useful as a step in a full division, but this instruction must be combined with others and repeated in order to complete a full integer division.
I'm still a bit unclear on why 1290 is the maximum base we can use for the bignums module. Is it just that we have to be able to enter three-digit numbers of any given base into the computer without first converting them to that base, in which case I see why 1290 is the max? Once we perform addition or multiplication operations on any numbers, then the three-digit rule no longer applies. When is it that we have to limit ourselves to only three-digit numbers of any given base?
The only point in the source code for the Naturals module at
which the rationale for the three-digit limit is apparent is at lines
644-651, inside the DivideNatural procedure, in the local
function FindTrialDigit:
FirstThree := 0;
CursorToStartOfBidirectionalList (Residue);
for Position := 1 to 3 do begin
FirstThree := FirstThree * BaseOfNumeration +
ElementAtCursorInBidirectionalList (Residue);
AdvanceCursorAlongBidirectionalList (Residue)
end;
Trial := FirstThree div FirstTwo;
In the last statement in this passage, FirstThree, which is an
ordinary variable of the standard Pascal Integer type, is
divided by another integer using the built-in div operation,
in order to get an estimate of the next digit of the quotient in the full
division operation. But, as the immediately preceding statements show, the
value of FirstThree is constructed by evaluating a numeral
comprising the first three digits of the natural number
Residue, in the chosen base of numeration. So this method of
constructing a digit of the quotient will not work unless we can evaluate
any three-digit numeral in the chosen base of numeration without exceeding
MaxInt.
How big can a bignum get before using up the available memory on the HP?
In other words, what's the MaxInt for the bignum type?
It depends on how much memory and swap space the workstation you're on has, how much of it is being used by other programs, how many other objects your program has dynamically allocated, whether they've been recycled correctly, and so on. I've seen Pascal programs on one of our 712/60s allocate as much as fifty megabytes of dynamic memory, which would be room enough for a bidirectional list containing four million components, so I suppose a plausible estimate for the number you're looking for would be 1290^4000000, which is about 10^12442359.
Why all this trouble with bignums? Why not just set MaxInt
higher or lower?
Because the limitation that is symbolized by the value of
MaxInt is actually a limitation of the underlying machine
hardware: The arithmetic operations that are performed by the processor
take operands of thirty-two bits and no more -- that's just how the
processor was designed. Adding more bits to the processor for some
particular program is not an option.
Besides, how do you propose to do it? One can set MaxInt
to a lower value by redefining it, though that will not affect the
arithmetic operations in the least. But a definition such as
const
MaxInt = 18446744073709551615; { = 2^64 - 1 }
will simply be rejected by the compiler:
0 3.000 0 MaxInt = 18446744073709551615; { = 2^64 - 1 }
^
**** ERROR # 1 INTEGER OVERFLOW (007)
The problem with the standard integer data type seems to be size and the
problem with the real data type seems to be precision. Do
precision-conscious organizations design improved data types, in the spirit
of bignums, to handle real numbers?
Yes. You can see such structures in action in programs like Maple and
bc, both of which allow the user to specify any number of
significant digits to be retained in computations involving real numbers.
(For instance, executing the assignment statement Digits :=
50; in Maple causes all subsequent computations involving reals to
be carried out to fifty significant figures, in decimal numeration; the
statement scale = 40 in bc directs the program to
keep forty digits after the decimal point in every computation.)
The handout on ratios describes another approach to the same problem.
Is the difference between mod and modulo that
when you have a negative moduland, the modulo will be the negative of the
modulus?
The result of the modulo operation always has a smaller absolute value than the modulus and so can never be its negative.
The difference between standard Pascal's mod operation and the
modulo operation is that it is an error for the second operand to
mod to be negative.
One way of describing the difference between the remainder operation and the modulo operation is that if the operands differ in sign and the division doesn't come out even, then the result of the remainder operation has the same sign as the first operand (the dividend), while the result of the modulo has the same sign as the second operation (the modulus).
Something confuses me in the naturals.p module. You have two
separate import clauses; one is:
$search 'natural-elements.o, bidirectional-lists.o'$ import Elements, BidirectionalLists;and later:
$search 'natural-elements.o, stacks.o'$
import
StdErr, Stacks;
Why do this twice?
The idea is that identifiers from the Elements and
BidirectionalLists modules are used in the export
section of the Naturals module, whereas identifiers from the
StdErr and Stacks modules are used only in the
implement section and so need to be imported only into that
section.
When using the Naturals module, must I also do the search
for the stacks.o and do the additional imports of
StdErr and Stacks?
No, you don't have to mention stacks.o in the
$search clause or Stacks in the
import clause unless you actually use identifiers from that
module in your own code. However, you must include StdErr in
the program header, and you must include stacks.o as one of the
object files to be linked when you compile the application that imports
Naturals.
The situation is different for the BidirectionalLists module,
because Pascal actually needs the identifiers found there to help it
complete the linking -- it can't confirm that the arguments to procedure
and function calls in the main program have the right types unless it knows
all the identifiers in the corresponding procedure and function headers,
and it has to search the .o file for the type definitions.
How does Maple compute numbers raised to a large power -- for example,
2^5678? If I execute this operation in Maple I get an answer back
immediately. If I try to perform this computation in our Ints
or Naturals module I have to wait a very long time, if it ever
finishes.
Maple probably uses a faster but more complicated algorithm to perform
exponentiation and multiplication than the one I proposed for the
Naturals module, but I would imagine that the greatest
increase in speed comes from the fact that Maple does not recycle storage
as it goes along, but rather waits until a convenient moment (when it's not
in the middle of a computation, for instance) to perform garbage
collection.
In the Ratios module, what happens if the denominator value is 0?
It is impossible to construct a ratio with a zero denominator.
MakeRatio traps the attempt to construct such a ratio with its
initial Assert statement, and there are similar traps in both
of the arithmetic operations that might otherwise yield such a result
(DivideRatio and ReciprocalOfRatio).
In reviewing the code, however, I perceive that I failed to put the
appropriate assertion into the ReadRatio function; since
Debug is True in the current version of the
module, the user gets a report saying that ReadRatio is
returning an incorrectly constructed ratio, but the program proceeds
anyway. Well, that's easily fixed, and it will give me an opportunity to
fix a memory leak at the same time. Replace the lines
ReadNatural (Source, D, Success); if not Success then goto 99in the
Ratios module with
ReadNatural (Source, D, Success); if ZeroNatural (D) then begin Success := False; DeallocateNatural (D) end; if not Success then begin DeallocateNatural (N); goto 99 endThis causes
Success to be set to False if a
fraction with denominator 0 is read.How much of a concern is it that ratios eventually get larger and larger and larger? I know the workstations we use have lots of memory and the ability to swap large pieces of it out to disk, but wouldn't a ratio-intensive program cause trouble eventually?
Not all applications of ratios encounter this problem. For instance, if you're using ratios to represent quantities of money (in American units), none of the operations that you're likely to perform will result in ratios with denominators larger than 100. Even if you have to perform an occasional tax or exchange-rate computation that produces outsize ratios, it's easy to convert back to the denominator 100 by rounding.
One of the things that I was hoping to put into the handout on ratios is an ingenious algorithm for finding good and simple approximations to ratios. It implements an operation that I'll call simplify, which takes a ratio (usually one with gigantic numerator and denominator) and a tolerance and returns the simplest ratio differing from the given ratio by an amount not exceeding the tolerance. The result is simplest in the sense that both its numerator and its denominator are at least as small as those of any other ratio lying within the range specified by the tolerance. I ran out of time before inserting this operation, but here's what the handout really should contain:
floor
Input: operand, a ratio.
Output: result, a ratio.
Preconditions: None.
Postcondition: result is the greatest ratio with denominator
1 not exceeding operand.
simplify
Inputs: approximand and tolerance, both
ratios.
Output: approximation, a ratio.
Preconditions: None.
Postconditions: approximation is in the range bounded
(inclusively) at one end by the difference between approximand
and tolerance and at the other end by their sum.
approximation is simpler than any other ratio in that range,
in the sense that both its numerator and its denominator are less than or
equal to those of any other ratio in the range.
function FloorRatio (Operand: Ratio): Ratio;
var
Quot: Natural;
{ the (whole-number) quotient of the numerator and the denominator of
the ratio }
Rem: Natural;
{ the remainder resulting from that division }
Result: Ratio;
begin
Assert (Operand <> nil, UninitializedRatioException,
RatioExceptionHandler);
DivideNatural (Operand^.Numerator, Operand^.Denominator, Quot, Rem);
if (Operand^.Sign = Nonnegative) or ZeroNatural (Rem) then
Result := BuildRatio (Operand^.Sign, Quot,
PascalIntegerToNatural (1), True)
else begin
Result := BuildRatio (Negative, SuccessorOfNatural (Quot),
PascalIntegerToNatural (1), True);
DeallocateNatural (Quot)
end;
DeallocateNatural (Rem);
if Debug then
Assert (ValidRatio (Result), InvalidRatioException,
RatioExceptionHandler);
FloorRatio := Result
end;
function SimplifyRatio (Approximand: Ratio; Tolerance: Ratio): Ratio;
var
LowerBound, UpperBound: Ratio;
{ the boundaries of the range within which the result must lie }
Result: Ratio;
function PositiveSimplest (LowerBound, UpperBound: Ratio): Ratio;
var
FloorOfLower, FloorOfUpper: Ratio;
{ the greatest integers not exceeding LowerBound and UpperBound,
respectively }
Result: Ratio;
{ the simplest ratio lying between LowerBound and UpperBound,
inclusive }
FractionOfLower, FractionOfUpper: Ratio;
{ the fractional part of LowerBound and UpperBound, respectively }
ReciprocalOfFOL, ReciprocalOfFOU: Ratio;
{ the reciprocal of FractionOfLower and FractionOfUpper,
respectively }
SimplestBetweenReciprocals: Ratio;
{ the simplest ratio in the range bounded by ReciprocalOfFOL and
ReciprocalOfFOU }
ReciprocalOfSBR: Ratio;
{ the reciprocal of SimplestBetweenReciprocals }
One: Natural;
{ the natural number 1 }
UnitRatio: Ratio;
{ the ratio 1/1 }
begin
One := PascalIntegerToNatural (1);
FloorOfLower := FloorRatio (LowerBound);
FloorOfUpper := FloorRatio (UpperBound);
if EqualRatios (FloorOfLower, LowerBound) then begin
AssignRatio (Result, LowerBound);
PositiveSimplest := Result
end
else if EqualRatios (FloorOfLower, FloorOfUpper) then begin
FractionOfLower := SubtractRatio (LowerBound, FloorOfLower);
ReciprocalOfFOL := ReciprocalOfRatio (FractionOfLower);
DeallocateRatio (FractionOfLower);
FractionOfUpper := SubtractRatio (UpperBound, FloorOfUpper);
ReciprocalOfFOU := ReciprocalOfRatio (FractionOfUpper);
DeallocateRatio (FractionOfUpper);
SimplestBetweenReciprocals :=
PositiveSimplest (ReciprocalOfFOU, ReciprocalOfFOL);
DeallocateRatio (ReciprocalOfFOL);
DeallocateRatio (ReciprocalOfFOU);
ReciprocalOfSBR := ReciprocalOfRatio (SimplestBetweenReciprocals);
DeallocateRatio (SimplestBetweenReciprocals);
PositiveSimplest := AddRatio (FloorOfLower, ReciprocalOfSBR);
DeallocateRatio (ReciprocalOfSBR)
end
else begin
UnitRatio := BuildRatio (Nonnegative, One, One, False);
PositiveSimplest := AddRatio (FloorOfLower, UnitRatio);
DeallocateRatio (UnitRatio)
end;
DeallocateNatural (One);
DeallocateRatio (FloorOfLower);
DeallocateRatio (FloorOfUpper)
end;
function Simplest (LowerBound, UpperBound: Ratio): Ratio;
var
Result: Ratio;
{ the simplest ratio in the range between LowerBound and UpperBound,
inclusive }
NegatedLower, NegatedUpper: Ratio;
{ the additive inverses of LowerBound and UpperBound }
NegatedResult: Ratio;
{ the additive inverse of the correct approximation }
begin
if LessRatio (UpperBound, LowerBound) then
Simplest := Simplest (UpperBound, LowerBound)
else if EqualRatios (UpperBound, LowerBound) then begin
AssignRatio (Result, LowerBound);
Simplest := Result
end
else if PositiveRatio (LowerBound) then
Simplest := PositiveSimplest (LowerBound, UpperBound)
else if NegativeRatio (UpperBound) then begin
NegatedLower := NegateRatio (LowerBound);
NegatedUpper := NegateRatio (UpperBound);
NegatedResult := PositiveSimplest (NegatedUpper, NegatedLower);
Simplest := NegateRatio (NegatedResult);
DeallocateRatio (NegatedLower);
DeallocateRatio (NegatedUpper);
DeallocateRatio (NegatedResult)
end
else
Simplest := BuildRatio (Nonnegative, PascalIntegerToNatural (0),
PascalIntegerToNatural (1), True)
end;
begin
Assert ((Approximand <> nil) and (Tolerance <> nil),
UninitializedRatioException,
RatioExceptionHandler);
LowerBound := SubtractRatio (Approximand, Tolerance);
UpperBound := AddRatio (Approximand, Tolerance);
Result := Simplest (LowerBound, UpperBound);
DeallocateRatio (LowerBound);
DeallocateRatio (UpperBound);
if Debug then
Assert (ValidRatio (Result), InvalidRatioException,
RatioExceptionHandler);
SimplifyRatio := Result
end;
The algorithm that I've adapted here is due to Alan Bawden. I first
encountered it in the IEEE standard for the Scheme programming language,
which has a predefined procedure equivalent to simplify.By judiciously timed applications of simplify, one can often keep the ratios from getting out of hand without losing too much accuracy.
Ok, we have ratios to try to get around the rounding errors of reals...is there any data type of this sort which represents irrational numbers exactly? Sometimes it might be nice to represent the square root of seven in an exact way, but it seems offhand that the calculations for operations might be very tricky...
Exactly. One approach is to store an irrational number as a function that computes and returns that number; then the ``sum'' of two such numbers, x and y, would be a function that invokes x and y and returns the sum of the values they yield.
Another idea is to develop a symbolic-algebra system in which the values reflect the syntactic structure of the expressions that mathematicians use to denote them, and to apply ``simplification rules'' from time to time to ensure that, for instance, the sum of the positive and negative square roots of 7 is reduced to the integer 0, while their product is reduced to -7. This is indeed very tricky, but the examples of Maple, Mathematica, MACSYMA, and similar programs show that it can be done.
Why do they call the binary search binary? It doesn't seem to have anything to do with binary numbers.
``Binary'' simply means ``made of or based on two things or parts.'' Since a binary search works by repeatedly bisecting the range to be searched, the name seems appropriate. You're right in observing that the binary search algorithm does not presuppose any particular system of numeration.
After reading the section on binary searching, I was wondering how efficient it would be if one were to implement a ``ternary'' search that divided a data list into thirds rather than halves.
Obviously, such a search would require another comparison to be made on every pass through, and comparisons would involve more processor time and another call to memory, thus adding some overhead. But a ternary search would also narrow down the data location more rapidly. So would the time required to make another comparison outweigh the time saved by narrowing down the location more quickly?
How would one compute mathematically the theoretical efficiency of such an algorithm? The worst-case scenario would log(base 3) of n, correct? How about the average case?
How could I test this on the computer? I remember that in 151 we once tested various sorting algorithms by measuring elapsed processor time, but the function we used on the Sun computers doesn't work on the HP's.
In theory, a ternary search should be only slightly slower than a binary search in the average case, since it would require an average of 5/3 comparisons on each of about log3(n) iterations rather than one comparison on each of about log2(n) iterations; 5/3 is about 1.67, and log2(n)/log3(n) is log3(2), which is about 1.58. In practice I would expect the ternary search to be noticeably less efficient, because one iteration of a ternary search loop would probably actually require more than 5/3 as much processing as one iteration of a binary search loop.
The easiest way to time an algorithm on the HPs is to run the program that embodies it under the control of the time utility. If the file containing the executable version of the program is called frogs, then the command
time frogswill first run that program and then issue a report that looks something like this:
6.3u 0.8s 0:12 15%meaning that the processor needed 6.3 seconds to execute the parts of the programmer's part of the code and 0.8 seconds to execute the ``system calls'' provided by the operating system, that the completion of the program 12 seconds of real time (including time that the processor spent working on other jobs), and that the execution of the program required 15% of the processor's total capacity.
Is there a way to figure how long (in seconds) it will take a program to run on the HPs, apart from actually running it?
Not in general. Each HP workstation is a time-sharing system, so that it distributes access rights to the processor among dozens or even hundreds of processes, a fraction of a second at a time. The execution time for any one process often depends, therefore, on how many other processes are contending with it for processor time and what processes they are. Some of these antagonists may not even exist when the program you want to measure is started. Programs that perform input and output may also contend for access rights to the peripheral devices.
In theory, one can sometimes examine a compiled version of the program and figure out how long it will take the processor to perform each instruction and how many times each instruction will be performed; adding up all the instruction times should yield the total execution time. In practice, the computation is often quite difficult. Also, the running time usually turns out to depend on the specific input to the program, which cannot always be examined in advance.
Is there a way to prove the maximum efficiency of a searching algorithm? The binary search technique seems to be as fast as one could search, but is there a way to formalize this?
The fastest possible sorting algorithm computes the location of the item sought within the array directly from the key in constant time; in many cases, this is actually faster than binary search. A constant-time search is obviously optimal, since any search technique must examine at least one array element in the case of a successful search (to confirm that it is successful).
How old are the searching algorithms we discuss in class? I know some of them are pretty old, but are some of the less obvious ones relatively recent? Are new ones being developed?
The origins of linear search, not surprisingly, are lost in antiquity.
The first example of a large data set being ordered for the purpose of speeding up searches is a Babylonian clay tablet, made about 200 B.C., containing a table of about eight hundred numbers and their reciprocals, in the base-60 notation that the Babylonians used at that time.
Ancient users of many alphabetic scripts, including Greek and Hebrew, had a standard sequence for the letters of their alphabets, and there are some documents from as early as 300 B.C. in which words or names are grouped on the basis of their initial letters. These lists, however, are not completely sorted -- for instance, all of the names beginning with A precede all of the names beginning with B, but the names beginning with A might be in any order with respect to one another.
The algorithm for completely alphabetizing strings was first described in 1286, by Giovanni di Genoa, in a book entitled Catholicon.
The binary-search algorithm was first loosely described in 1946, by John Mauchly; his description was published in G. W. Patterson's collection Theory and techniques for the design of electronic digital computers. Mauchly's version of the algorithm presupposed that the size of the array to be searched was one less than a power of two. Herman Bottenbruch gave the first general and correct statement of the algorithm in a 1962 article in the Journal of the ACM.
These details come from pages 417-419 of Donald E. Knuth's Sorting and searching, volume 3 of The art of computer programming, which gives more of the history of searching, with additional bibliographical references.
Searching and sorting algorithms are still being actively studied and improved. For sorting, in particular, it turns out that there is no one algorithm that is ``best'' in all possible applications; the choice depends on the size of the data structure to be sorted, whether it is a random-access or sequential-access structure, how much processing time it takes to move one element of the structure to a new position, and so on. Since there are always new applications and new kinds of environments in which sorting is done, there is still a lot of room for innovation.
Why are we studying sort algorithms for arrays instead of linked lists? I understand that arrays are generally more useful if you know the amount of data you wish to store, but isn't this more often not the case? Wouldn't linked lists be more commonly used?
I wanted to deal with sorting in connection with arrays first because many of the algorithms can be expressed more simply and straightforwardly on arrays. Also, of course, the fact that an array is a random-access structure and a linked list is not affects the kinds of sorting algorithms that one can use -- basically, any linked-list sort can be applied to an array with no loss in speed, but the reverse is not true.
We shall, however, take up sorting again later in the semester, in connection with linked structures.
When we are trying to design a method to sort or search an array, we usually consider how to reach the target as soon as possible. But we can not go directly or more quickly to a target while searching a linked list, since we have to follow the pointers one by one. Why are linked structures more commonly preferred in programming?
Because in most applications one doesn't know in advance how many elements the structure will need. Arrays are comparatively inflexible; if a program runs out of space in an array, there is usually nothing it can do to remedy the problem.
How are these various sorting routines discovered? Are there honestly computer scientists sitting around in offices trying to find them? Or are they stumbled upon randomly?
Insertion sort, selection sort, merge sort, and radix sort were adapted from sorting rituals that were known before the computer era. Quicksort and heapsort were indeed found by computer scientists sitting in offices trying to figure out how to sort particular data sets efficiently. As for the binary search tree sort, the data structure itself was invented in order to make searches easier rather than to provide a sorting method, but once you have the structure the binary search tree sort is an obvious application of it.
The implementations given for binary search trees in Walker's text and in your handout use recursive procedures; would it be beneficial in terms of memory conservation to use iterative procedures instead? Or, since the number of procedure calls that build up is relatively small even with very large data sets (as compared to straight linked lists) in most cases, is it just plain easier to go with recursion and sacrifice a small amount of memory? It seems like the iterative solutions in this case might become rather complex, too.
The recursive procedures and functions that Walker and I proposed can be
divided into two groups -- those that issue at most one recursive call
before they are completed (Walker's Insert, my
InsertIntoBinarySearchTree,
SearchBinarySearchTree, and
DeleteFromBinarySearchTree) and those that sometimes issue
more than one (Walker's Print, my
PrintBinarySearchTreeData,
ApplyThroughoutBinarySearchTree, and
DeallocateBinarySearchTree). The routines in the first group
can be converted fairly easily into iterative versions; Walker gives an
iterative version of Insert, and here's what
SearchBinarySearchTree would look like as an iterative
function:
function SearchBinarySearchTree (Sought: KeyType; B: BinarySearchTree;
var Found: Element): Boolean;
var
Searching: Boolean;
{ indicates whether the search can and should continue }
begin
Searching := True;
while Searching do
if EmptyBinarySearchTree (B) then begin
Searching := False;
SearchBinarySearchTree := False
end
else if Sought < B^.Datum.Key then
B := B^.Left
else if B^.Datum.Key < Sought then
B := B^.Right
else begin
Searching := False;
SearchBinarySearchTree := True; { because Sought = B^.Datum.Key }
Found := B^.Datum
end
end;
On the other hand, making the routines in the second group iterative is
more difficult, and the most straightforward way to do it would be to
construct a stack of ``unfinished jobs'' -- subtrees that have not yet been
fully traversed or processed. Here's how the
PrintBinarySearchTreeData procedure looks in interative form:
procedure PrintBinarySearchTreeData (B: BinarySearchTree);
var
Postponed: TreeStack;
{ a stack of subtrees that have not yet been fully printed }
Current: BinarySearchTree;
{ a pointer to a subtree whose left subtree either is empty or has been
completely printed }
begin
Postponed := CreateTreeStack;
while B <> nil do begin
PushToTreeStack (B, Postponed);
B := B^.Left
end;
while not EmptyTreeStack (Postponed) do begin
Current := PopFromTreeStack (Postponed);
WriteLn (Current^.Datum);
Current := Current^.Right;
while Current <> nil do begin
PushToTreeStack (Current);
Current := Current^.Left
end
end
end
However, this doesn't save much space, since the Postponed
stack grows in exactly the same way that the run-time stack does in the
recursive version of the program. So if a procedure or function sometimes
makes more than one recursive call as it executes, it's usually better
(simpler, anyway) to use the recursive version.
I created a binary search tree module from the code you gave us in the
handout. One field in my Element record is a pointer from
which I build a linked list. My question is: if I delete a node from the
binary search tree, using the keytype of an integer which is another field
of the element record, do I need to amend the deletion procedure so as to
deallocate the linked list in the element record, or is the deletion
procedure OK as is? If I need to change it, then how?
It depends on whether you built a complete copy of the list in freshly
allocated storage when you inserted the node into the binary search tree to
begin with, or simply copied a pointer to that list into the relevant field
of the Element record. If you made a new copy of the list and
have no more use for the copy once the node has been deleted, you should
recycle it by calling DeallocateList and giving it the
relevant field of the Element record as argument. On the
other hand, if you added the list to the binary search tree by copying a
pointer, or if you are going to recover the list from the
Element record being deleted and use it for something, then it
would be a mistake to recycle the list.
Would it be logical to have multiple trees constructed out of the same data set, with the data sorted by two or three or more different keys using a type definition -- something like this:
type
Element = record
PhoneNum: ...
LeftPhone, RightPhone: BinarySearchTree;
LastName: ...
LeftLastName, RightLastName: BinarySearchTree;
StreetAddress: ...
LeftStreet, RightStreet: BinarySearchTree
end;
Woule this be plausible? I guess what I'm wondering is, since when you
allocate a record of data, it never moves inside the computer memory, are
the pointers to and from it simply assigned different addresses when a data
set is sorted? If so, multiple pointers would maintain correct addresses
(assuming everything is coded correctly, of course) even when the order is
changed according to one or more keys.
This can work, but only if you never use the deletion operation shown in
the handout. Since DeleteFromBinarySearchTree actually moves
one surviving datum from one node to another without modifying the
pointers, it invalidates your assertion that the record never moves inside
the computer memory. But all the other operations will work with the
structure you propose.
Your ApplyThroughoutBinarySearchTree is a little confusing
to me. What is this procedure P?
It's a procedure parameter. When one invokes
ApplyThroughoutBinarySearchTree, one gives it both the tree
to be traversed and the name of the procedure to be applied to every
element of the tree. For instance, if the elements of the tree are
integers, one could define a procedure to replace any integer value with
its cube --
procedure Encube (var N: Element); begin N := N * N * N end;-- and then use the call
ApplyThroughoutBinarySearchTree (CountTree,
Cube) to replace every element in the tree CountTree
with its cube.Are binary trees really worthwhile? Considering a large data-set, wouldn't the amount of storage required to create the structure be far larger than it's worth? I know I'd far rather have my processor chug away doing a slower sort/search method than try to store that entire structure in memory...
You're storing two pointers -- eight bytes, on the HPs -- per element. This is the same amount of storage overhead as in a bidirectional list (slightly less, actually, since there's no header), four bytes per element more than in a singly-linked list, eight bytes per element more than in an array. In exchange, you get insertion and deletion in O(lg n) time in the average case, as opposed to O(n) time in a sorted array or list; you get binary search, which is unavailable with lists; and you have no upper bound on the size of the structure, which is unavailable with arrays. Often it's a good trade-off.
In many applications the records themselves are much larger than eight bytes, so that the extra storage used is comparatively insignificant.
Is it desirable to balance your binary trees - that is, too make sure that they consist of the fewest possible number of subtrees? Is it more efficient to have each element have a full left and right subtree, with the ``bottom row'' being all nil pointers?
If so (or if not, I suppose), is there an algorithm for this?
The insertion, deletion, and search algorithms all run faster, on the average, if the tree being operated on is balanced than if it is not. There are several different mechanisms for keeping binary search trees balanced; the usual approach is to make local adjustments of the tree structure during insertion and deletion to make the heights of subtrees more nearly equal. One such method, originally due to G. M. Adel'son-Vel'skii and E. M. Landis, is described in section 8.5 of Walker's textbook (pages 333-342). A somewhat simpler approach, which however requires an extra one-bit field in each node, is described in chapter 14 (``Red-black trees'') of Introduction to algorithms, by Thomas H. Cormen, Charles E. Leierson, and Ronald L. Rivest (Cambridge, Massachusetts: The MIT Press, 1990).
Would it be useful to start a binary tree with a given pivot if one knows the approximate median of a set to be sorted, and then let the tree build from that pivot? Are there more effective methods of avoiding a linear type of tree?
Yes, if you know or can estimate the median of a data set, it would be advantageous to start the construction of the tree by inserting that median value at the root.
Since the pathological cases that seriously affect the efficiency of insertion, deletion, and search in a binary search tree are relatively rare and very seldom arise in randomly ordered data, one way to avoid them is to randomize the order of the data before starting to build the tree. It is usually possible to shuffle a data set in O(n) time.
Will we be learning methods for sorting trees other than binary search trees, or is it just more convenient to transform any tree into a binary search tree?
We'll shortly be studying the heapsort, which uses a binary-tree structure with a different ordering property. If you have a tree in which the values are arranged randomly, however, the easiest way to sort it may be to copy the values into another structure, such as a binary search tree, and recover them in sorted order from that structure.
In what types of situations should we use binary trees? When are they used in real-world programming?
Binary trees are best suited to applications in which an unpredictable number of elements have to be arranged fairly quickly in some structure that is efficiently searchable. Since deletion is somewhat less efficient than insertion, binary trees tend to be used in cases where deletion is never needed -- either the structure is constructed at the beginning of the run and never changed at all, or else it is initialized at the beginning of the run and changes only by insertion. In a typical use of binary trees, search is far more frequent than insertion.
For instance, many file-compression programs construct binary trees to store the correspondences between source-file characters and strings on one hand and the bit-patterns that represent them in the compressed file on the other. One common arrangement is to store a representation of such a binary tree at the beginning of the compressed version of the file; the program that recovers the original source file from the compressed version begins by reading in this tree and then repeatedly searches it, using the bit-patterns from the compressed file as keys, to obtain the original characters and strings.
In what situation should one choose sorting and searching with binary trees over other algorithms?
The binary search tree sort described in the handout can deal with any number of elements. It can easily be modified to make it a stable sort (by changing the insertion procedure, as described in class, so that an element containing the same key as the element at the root of a given tree is placed in its right subtree). It performs O(n) data movements and O(n lg n) comparisons, so it's comparable in efficiency to merge sorting. Like merge sorting, it requires extra storage for a copy of each element. You use the binary search tree sort in situations where these characteristics are desirable. For example, in exercise 5, it would have been a good idea to store the index entries in a binary search tree.
Why do you use a temporary variable (Result) in the
function MakeSingletonBinarySearchTree? Couldn't you just as
easily assign the various items to the function name itself -- writing, e.g.,
MakeSingletonBinaraySearchTree^.Datum := Elm;?
This wouldn't work, because the compiler would try to treat the occurrence of the function identifier as a recursive call to the function rather than a name for the value to be returned.
What does the
Like this:
procedure ApplyThroughoutBinarySearchTree (var B: BinarySearchTree;
procedure P (var Elm: Element));
begin
if not EmptyBinarySearchTree (B) then begin
ApplyThroughoutBinarySearchTree (B^.Left, P);
P (B^.Datum);
ApplyThroughoutBinarySearchTree (B^.Right, P)
end
end;
Why are trees for sorting ordered from left to right rather than top to
bottom? It seems to me that the latter would be more natural for a
tree.Binary search trees are ordered from left to right so that one can easily bisect the structure, working down from the root, when performing a search. Binary trees can, however, be ordered in other ways, and in particular the kind of binary tree called a heap can be arranged in the way you describe. It all depends on what kinds of operations one wants to be able to perform efficiently; binary search trees are good for searching, but finding, say, the least element of a binary search tree is an O(n) operation even in the average case, whereas in a heap the least element can be identified in constant time.
Let's say that I have a record type like this:
Link = ^PlayerNodeType;
PlayerNodeType = record
Name: String; { some string type }
BAvg: Real;
Left, Right : Link;
end;
and I build up a binary search tree based on sorting names. Suppose I then
decide that I really wanted to build it based on sorting batting average.
Is there any way to rebuild the tree other than tearing down the old one
and inserting the new elements into a new tree using a new insertion
routine?
If you stored four pointers instead of two in each record, you could build two different binary search trees from the same nodes, with one ordered by name and the other by batting average. But you'd be doing a separate insertion on each binary tree, using closely similar but not identical insertion procedures. Deletion from a such a structure would be more difficult, since the idea of overwriting data in a hard-to-delete node could not be used.
Is there any algorithm that can force a tree to be balanced?
Yes. It's easier if you don't require the tree to have the ordering property that would make it a binary search tree, but there are fairly efficient ways to build binary search trees without allowing the height to exceed some fixed multiple of the logarithm of the number of nodes. The idea is to make local adjustments of the tree structure during insertion and deletion to make the heights of subtrees more nearly equal. One such method, originally due to G. M. Adel'son-Vel'skii and E. M. Landis, is described in section 8.5 of Walker's textbook (pages 333-342). A somewhat simpler approach, which however requires an extra one-bit field in each node, is described in chapter 14 (``Red-black trees'') of Introduction to algorithms, by Thomas H. Cormen, Charles E. Leierson, and Ronald L. Rivest (Cambridge, Massachusetts: The MIT Press, 1990).
Would there ever be a case when you might take data out of an array, sort it by creating a binary tree, and then writing the contents of the tree back into the array (or into a new array)? I know that this would be an inefficient use of memory, but I think it would be far easier to code than a mergesort (or something comparable), and I don't know whether it would take less processing time.
Sure. The handout on searching and sorting with binary trees contains a procedure that does exactly what you describe. If the elements of the array are randomly ordered to begin with, the binary-search-tree sort runs in O(n lg n) time, on the average and so is faster than insertion sort or selection sort if n is sufficiently large.
Besides searching and sorting, what else can one do with binary trees?
As we'll see in Friday's reading, binary trees can be used to represent hierarchies of all sorts, such as directory structures under Unix or algebraic expressions in a symbolic-algebra system. Binary trees can often be used to implement sets efficiently -- not limited Pascal-style sets, but sets of any desired base type.
It seems that by using a binary tree, it would be possible to store long sequences of mathematical operations. They would be represented by operators and operands, with all of the leaves being operators and the other elements being operands. How exactly would this be coded?
type
Operation = (Add, Subtract, Multiply, Divide {, ...});
Link = ^Node;
Node = record
case Leaf: Boolean of
True:
(Operand: Real);
False:
(Op: Operation;
Left: Link;
Right: Link)
end;
Expression = Link;
I wonder how to insert a whole tree into another without inserting
elements of the first tree one by one.Replace a nil pointer in one of the second tree's leaf or semileaf nodes with a pointer to the root of the first tree.
On page 265, Walker says that by using prefix and postfix notations we can avoid the use of parentheses. How exactly would you write the expression 2 + 3 + 4 in prefix or postfix notation without the use of parentheses?
I'll assume that, following the usual mathematical convention, you want first to add 2 and 3 and then to add 4 to the result.
Prefix: + + 2 3 4
Postfix: 2 3 + 4 +
Of course, since addition is associative, you get the same result if you add 3 and 4 first and then add the result to 2:
Prefix: + 2 + 3 4
Postfix: 2 3 4 + +
Walker also states that postfix notation is used by many calculators and we use infix notation every day. Can you give an example of when prefix notation would be desirable?
Some programming languages, notably LISP, use prefix notation exclusively. It tends to be handier when you have a lot of operators that take three or more operands.
In his description of heap sort, why does Walker refer to the original structure he is trying to order as an array?
Because the heap that's used in the sort -- the partially ordered binary tree -- is implemented as an array, with the root of the binary tree in position 1 of the array and the children of node n at positions 2n and 2n + 1.
I find it tremendously counterintuitive to pretend that an array is a
binary tree, and since there isn't a whole lot of code in Walker's
description of a heap sort, I can't quite imagine how one would write it.
Can you just recode the handout you gave us, so that instead of looking at
Tree^.Parent you look at Arr[Position div 2],
making corresponding changes for the two child nodes?
Okay. Here's the code from the handout, rewritten as described:
const
MaximumPriorityQueueSize = 1000; { or whatever }
type
BinaryTree = array [1 .. MaximumPriorityQueueSize] of Element;
PriorityQueue = record
Size: Integer;
Data: BinaryTree
end;
function CreatePriorityQueue: PriorityQueue;
var
Result: PriorityQueue;
begin
Result.Size := 0;
CreatePriorityQueue := Result
end;
procedure UpHeap (var BT: BinaryTree; Index: Integer; NewElement: Element);
var
ParentIndex: Integer;
{ the location in BT of the parent of the node with number Index }
begin
if Index = 1 then { at the root of BT }
BT[Index] := NewElement
else begin
ParentIndex := Index div 2;
if NewElement.Priority <= BT[ParentIndex].Priority then
BT[Index] := NewElement
else begin
BT[Index] := BT[ParentIndex];
UpHeap (BT, ParentIndex, NewElement)
end
end
end;
procedure InsertInPriorityQueue (Insertend: Element;
var Base: PriorityQueue);
begin
Base.Size := Base.Size + 1;
UpHeap (Base.Data, Base.Size, Insertend)
end;
procedure DownHeap (var BT: BinaryTree; Index: Integer; TreeSize: Integer;
NewElement: Element);
var
Advancer: Integer;
{ the location in BT of the larger child of the element at position
Index }
begin
if TreeSize < 2 * Index then
BT[Index] := NewElement
else begin
Advancer := 2 * Index;
if Advancer + 1 <= TreeSize then begin
if BT[Advancer].Priority <= BT[Advancer + 1].Priority then
Advancer := Advancer + 1
end;
if BT[Advancer].Priority <= NewElement.Priority then
BT[Index] := NewElement
else begin
BT[Index] := BT[Advancer];
DownHeap (BT, Advancer, TreeSize, NewElement)
end
end
end;
function ExtractForemostFromPriorityQueue (var Base: PriorityQueue):
Element;
begin
Assert (0 < Base.Size, ExtractForemostFromPriorityQueueException,
PriorityQueueExceptionHandler);
ExtractForemostFromPriorityQueue := Base.Data[1];
DownHeap (Base.Data, 1, Base.Size - 1, Base.Data[Base.Size];
Base.Size := Base.Size - 1
end;
function EmptyPriorityQueue (Operand: PriorityQueue): Boolean;
begin
EmptyPriorityQueue := (Operand.Size = 0)
end;
procedure DeallocatePriorityQueue (var Operand: PriorityQueue);
begin
end;
For the heapsort, you don't need the full repertoire of priority-queue
operations. Here's how the heapsort looks if you just want to sort an
array of integers into ascending order:
const
ArraySize = 1000; { or whatever }
type
IntArray = array [1 .. ArraySize] of Integer;
procedure HeapSort (var Arr: IntArray);
var
Index: Integer;
{ counts off positions in the array }
Temporary: Integer;
{ temporary storage for a datum that is about to be moved into its
correct position }
procedure DownHeap (Index: Integer; UpperBound: Integer;
NewElement: Integer);
var
Advancer: Integer;
{ the location in Arr of the larger child of the element at position
Index }
begin
if UpperBound < 2 * Index then
Arr[Index] := NewElement
else begin
Advancer := 2 * Index;
if Advancer + 1 <= UpperBound then begin
if Arr[Advancer] <= Arr[Advancer + 1] then
Advancer := Advancer + 1
end;
if Arr[Advancer] <= NewElement then
Arr[Index] := NewElement
else begin
Arr[Index] := Arr[Advancer];
DownHeap (Advancer, UpperBound, NewElement)
end
end
end;
begin { procedure HeapSort }
for Index := ArraySize div 2 downto 1 do
DownHeap (Index, ArraySize, Arr[Index]);
for Index := ArraySize downto 2 do begin
Temporary := Arr[1];
DownHeap (1, Index - 1, Arr[Index]);
Arr[Index] := Temporary
end
end;
The book says that in order to begin a heap sort, we must first assume
that the tree is partially ordered. Isn't this a large assumption to make?
Wouldn't it be better to make a sorting algorithm that works in any
case?
If you're inserting elements one by one into a tree, it's easy to keep the
tree partially ordered at all times; one insertion requires O(lg
n) time. Alternatively, if you're given an unordered tree, it's
fairly easy to turn it into a partially ordered tree by repeated calls to
DownHeap, altogether requiring only O(n lg
n) time. So the preprocessing step is not all that
time-consuming.
In Walker's description of how to obtain the first partially ordered array (pages 409-12), he fails to address an issue which confuses me. He says that you must start the ordering process with the last node with an offspring, sort it, and proceed until you get to the root of the tree. In ordering each subtree you may have to change the root with an offspring (if sorting is needed). However, Walker doesn't address what has to be done if the node to which you switched the old root now has offspring which are greater than it. In Walker's example, this occurs in the final step in which he says to interchange nodes 10 and 16 to get the partially ordered tree. Exchanging these nodes doesn't result in the partially ordered tree, however, as the sibling of 10 (12) is greater than 10. Thus it would appear to me that 10 and 12 would also have to be interchanged.
Would you thus have to check all offspring of every node to which you made a change? Why doesn't Walker address this?
You're right in saying that Walker's description of what happens in the
particular example he considers is incomplete. The general situation, when
you're working on a particular node, is that the left and right subtrees
have already been converted into heaps, so that only the element at the
root has to be repositioned. Positioning it correctly sometimes requires
more than one exchange, as you observe -- you have to move far enough down
one of the branches to pass by and promote any elements with higher
priorities than the one you are trying to position. In short, you need to
perform the operation that I called ``downheaping'' in the handout on priority queues; Walker uses the
identifier SearchDown for the same procedure. (Note that
SearchDown is invoked in the loop that sets up the partially
ordered array, as shown on page 412. So Walker does address this issue in
the code -- he just doesn't quite follow through on the particular example
in Figure 10.3.)
Why would one use HeapSort instead of sorting by a normally structured binary tree? It seems that it takes a lot more comparisons to retain the heap structure than it does to just insert a datum in a binary tree.
On the average, it does take a few more comparisons, but on the other hand heapsort isn't sensitive to the initial order of the data. The binary search tree sort is O(n^2) if the values to be sorted are already in order or in reverse order, because the search tree that is constructed is, in effect, a linked list. (If the data are inserted into the binary search tree in order, every node's left subtree is empty; if in reverse order, every node's right subtree is empty.) Heapsort, on the other hand, is O(n lg n) under all conditions.
If two items have the same priority, in what order are they processed? Are they treated in a first-come first-serve basis as in a regular queue or in some other way?
You can't count on any particular treatment of elements of equal priority. In the implementation using singly-linked lists, the treatment of such elements is last-in, first-out; in the implementation using heaps, it is usually first-in, first-out, but occasionally an element that is added later can be extracted earlier.
What fields are required in the record of a priority queue type?
It depends on how you are implementing the type. You're probably thinking
of the handout's implementation using binary trees, in which case you need
a Size field and a field for the pointer to the root of the
binary tree.
You've mentioned a couple of times that creating a print queue using priority queues would be a good idea. Why don't the printers on campus do this?
On MathLAN, the problem is that the print spooler supplied with the
operating system can't be configured to do this automatically, so you'd
have to write a front end for the lp command and get people
(and software!) to use it; I haven't been able to muster enough enthusiasm
for this project to overcome inertia. I imagine that the story is similar
for other print spoolers around campus.
In what order does one access a priority queue? Obviously the root is accessed first, but which then? Does one do the roots of the first subtrees, or access an entire subtree at a time?
If you study the limited repertoire of operations on priority queues,
you'll notice that the only way to recover an element is to take it from
the root -- that's the only ``accessible'' element, just as the
top element of a stack is the only accessible element. When you extract an
element from the root of a binary tree, the
ExtractForemostFromPriorityQueue operation reconfigures the
rest of the binary tree so that it has a new root, which is always the
element of highest priority among those that remain. The issue you're
worrying about does not arise, because the priority queue abstract data
type does not permit traversals.
I was a bit confused when I was reading the priority queues handout because I was thinking that things which take ``1st priority'' (i.e., a priority of 1) should be at the base of the binary tree. Is there any trick that would allow one to have the lowest numbers represent the highest priorities, or would that be a tedious process of adjusting all the greater than and less than signs in all the procedures and functions?
People's intuitions vary widely about how priorities should work. If your intuition tells you that ``top priorities'' should have numbers of small magnitude, close to zero, you may want to consider multiplying such magnitudes by -1 when storing them in priority queues, so that the ``top priorities'' will be recognized by the procedures and functions as numerically greater than the ``low priorities'' and heaped towards the root of the binary tree.
Some people find this idea even more confusing than upside-down priorities are to begin with. If you're one of them, you'll just have to think things through carefully every time you use or implement priority queues. Reversing greater and less in the priority-queue package will work, if you do it correctly, but since you'll have to think about every single occurrence of an inequality symbol it usually isn't any easier than just learning to cope with the idea that top priorities are numbers of large magnitude.
In your handout on priority queues, you say that the VAX print queues are prioritized according to the size of the job. From experience this is not done on the MathLAN printers. (My experience is that people try and print many-megabyte Netscape pictures and documents and tie up the queue, not allowing other smaller jobs to get through.) Is this indeed the case? I know that there is some prioritizing that goes on as the print manager has a column for priority. What are the print jobs prioritized by?
As each print job is created, it receives a number from 0 (lowest priority) to 7 (highest priority). It is this number that is used to arrange the print jobs that have arrived but not yet sent to the printer. On MathLAN, the default priority is always 0, so print jobs are processed in first-come, first-serve order. It is possible for a user to specify a higher priority for his own print jobs, but I don't regard this as a useful feature on MathLAN.
How is it possible to determine where in a heap a particular element should be? For example, if the element at the root of the heap has priority 100, is there any simple rule as to whether an element of priority 99 should go on the left branch or the right branch? What about an element of priority 98? Is the only determining factor insertion order, and if so, how does one go about finding the element with the next highest priority?
An element that does not belong at the root of a heap can be placed either in the heap's left subtree or in its right subtree, at the convenience of the programmer. The only thing that the ordering condition on a heap prohibits is placing an element of higher priority in a subtree beneath a node of lower priority.
In a heap, the element of highest priority is at the root. The element of second-highest priority might be either in the left subtree or in the right subtree, but it will be at the root of that subtree (since otherwise the ordering condition would be violated within that subtree). The element of third-highest priority might be in any of three positions -- it might be at the root of either subtree of the subtree headed by the element of second-highest priority, or it might be at the root of the other subtree of the main tree:
----------- ----------- -----------
| highest | | highest | | highest |
----------- ----------- -----------
/ / / \
---------- ---------- ---------- ---------
| second | or | second | or | second | | third |
---------- ---------- ---------- ---------
/ \
--------- ---------
| third | | third |
--------- ---------
After deleting the highest-priority element, you can always find the
element that should take its place by comparing the elements at the roots
of its subtrees and taking the one that has the higher priority.Is the insertion routine stable, in the sense that if there are two jobs of equal priority, the first in will be the first out?
No, it is not. It is possible for the earlier job to be inserted into the right subtree of the heap and the later one to be inserted into the left subtree, in which case the later one will be extracted first.
In the FindByNumber function, does the fact that
Parent is a local variable keep the recursive call of the
procedure from setting the return value for FindByNumber
prematurely?
Well, kind of. Getting a pointer to the node that has a specified node
number is really a two-step process: First find a pointer to the parent,
then look left or right depending on whether the node number is even or
odd. The Parent variable stores the information recovered at
the first step so that it can be used in the second step, that's all. It
would be possible to write an equivalent procedure containing a
variable-parameter instead:
procedure FindByNumber (BT: BinaryTree; NodeNumber: Integer;
var NodePointer: BinaryTree);
begin
if NodeNumber = 1 then
NodePointer := BT
else begin
FindByNumber (BT, NodeNumber div 2, NodePointer);
if Odd (NodeNumber) then
NodePointer := NodePointer^.Right
else
NodePointer := NodePointer^.Left
end
end;
But this seems no clearer to me.The handout on priority queues says that the heap is a binary tree. Is this just a simplification for our implementation? Could you have a heap that was a ternary or greater tree?
You could have a ternary or even a general tree that had the analogous ordering property -- that each node contains an element of higher priority than any of its children -- but the analogue of the downheaping operation would be slower, since it would take longer to determine which child should be promoted. Probably people would be willing to call such a data structure a heap, although I've only seen the term used in connection with binary trees.
Will we have a heap module or priority-queue module available to us?
All the source code you need can be found in the handout on priority queues.
When heap sorting, it doesn't matter if a left subtree's value is greater than a right subtree's, does it?
You're right, it doesn't.
Walker's brief reference to priority queues is somewhat confusing to me. It seems to imply that all jobs of higher priority have absolute priority over all jobs of low priority. But this doesn't quite fit with the Unix process scheduling behavior I see. I can set a long process which does lots of number crunching for the lowest priority the system allows, and it seems to always get CPU time. Is it just that it gets CPU cycles that would otherwise be idle?
As a time-sharing system, Unix tries to give each interactive user the illusion that it is paying constant attention to the user's demands; to assist this illusion, Unix sometimes gives a slice of the processor's time to a user process that has a low nominal priority but has been waiting for a turn for a ``long'' time. If a priority queue is used, the effective priority according to which it is organized is not necessarily the same as the nominal priority. Not all Unix systems use priority queues for scheduling at all. In the MINIX system, for example, the user-imposed priorities are simply ignored; system processes unconditionally run ahead of user processes, and all user processes have the same effective priority -- each one in turn gets the processor for a tenth of a second and then goes to the end of the line.
Walker's description of a job-scheduling algorithm doesn't mention Unix specifically and applies to management of various kinds of system resources, not just than the processor. Priority queues are probably more often used to implement print queues (the priority being the negative of the estimated number of pages in the job) than process schedules.
I used a C compiler once that provided a sort routine; the user provided the array, the length of the array, and a ``less than'' method. The documentation said that the sort was a quick sort.
I wonder how appropriate this is. It seems to me that heap sort would be a better general-purpose sort. It doesn't require auxiliary storage, and from the table on p. 423 in Walker, it has a very stable running time no matter how the input data is ordered. Is this merely a preference of mine that's different from the Borland developers, or did I miss something?
The function you're referring to is part of the standard C library, so every C compiler is supposed to provide it. The algorithm it uses doesn't have to be quicksort, though it often is.
Proponents of quicksort say that although both quicksort and heapsort are O(n lg n) algorithms in most cases, the coefficient on the leading term of quicksort's running-time function is smaller. In other words, the ratio of the running times of quicksort and heapsort is fairly constant, regardless of the size of the array, but it is less than 1.
Although quicksort is O(n^2) in the worst case, it is straightforward to arrange for the worst case to be extremely improbable, by selecting the pivot from the middle of the array segment or from a randomly selected position in the segment or as the median of three or five values chosen from various positions in the segment. None of these pivot-selection processes slows the algorithm down significantly.
In which situations would it be better to use heap sort than other sorting algorithms, and in which cases (if any) should heap sort not be used?
Heapsort is preferable when you have a large data set (otherwise you'd use selection or insertion sort), memory is scarce (otherwise you'd use merge sort), and you're more concerned about worst-case speed than about average-case speed (otherwise you'd use quicksort). You shouldn't use heapsort when you need a stable sort (one that does not change the relative order of elements with equal keys), either.
How does heap sort compare with other sorting algorithms?
Heapsort is an O(n lg n) algorithm, so it is faster than the O(n^2) insertion and selection sorts if n is sufficiently large. It is slower than most other O(n lg n) algorithms, primarily because it performs more data movements. A table showing typical statistics for a number of sorting methods can be found on page 423 of Walker's textbook.
What is the point of the set data structure? I've never seen it before and can't think of any situations where you couldn't get by without it. In any case it seems like a pretty obscure feature of the language.
Certainly it's one of the less influential features -- the only programming languages I know of that have sets like Pascal's are its direct descendants.
It's basically a mechanism for improving the efficiency of some kinds of tests. If, for instance, you want to know whether a given character is a vowel, the test
Ch in ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']is likely to run much faster than
(Ch = 'a') or (Ch = 'e') or (Ch = 'i') or (Ch = 'o') or (Ch = 'u') or (Ch = 'A') or (Ch = 'E') or (Ch = 'I') or (Ch = 'O') or (Ch = 'U')I found out how sets are stored under HP Pascal: Each possible member of the set has its own bit, and the bit is a 1 if the set includes that possible member and a 0 if it does not. Is this correct?
That's right. Of course, depending on the base type of the set and whether
it is defined to be packed, the representation of a set value
may also include many unused bits, which are never examined and so may be
either 0 or 1.
Why is Ch in ['A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o',
'u'] so much more efficient than (Ch = 'A') or (Ch = 'E') or
(Ch = 'I') or (Ch = 'O') or (Ch = 'U') or (Ch = 'a') or (Ch = 'e') or (Ch =
'i') or (Ch = 'o') or (Ch = 'u')?
Because of the way sets are implemented in HP Pascal. A value of the type
set of Char would be a sequence of 128 bits, one for each
possible member of the set, with the bit turned on to indicate that the
corresponding Char value is a member of the set and turned off
to indicate that it is not. The ten-element set referred to in the example
would have bits 65, 69, 73, and so on turned on (because Ord
('A') is 65, Ord ('E') 69, and so on), and the other
118 bits turned off. Testing whether a character Ch is a
member of such a set is simply a matter of finding the bit that corresponds
to Ch and seeing whether it is on; it's one test instead of
ten.
I seem to remember reading once that Pascal wasn't 100% reliable with
regards to the in-expression. The expression x in [1
.. 500] was likely to fail to correctly detect if x was
between 1 and 500. Is this in fact the case? If so, what good is the
in-expression?
Some implementations of Pascal impose an upper bound on the number of
elements that can be included in the base type of a set; in HP Pascal, for
instance, no set can have a base type containing more than 256 values. The
problem with the expression x in [1 .. 500], therefore, is not
so much with the operator in as with the set-expression that
follows it -- there is no set type that it could belong to in HP Pascal.
The operator in is most useful when you want to test whether
the value of some expression is one of a small collection of values that
happen not to be adjacent. The classical example is
function IsVowel (Ch: Char): Boolean; begin IsVowel := (Ch in ['A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o', 'u']) end;which is both simpler and more efficient than performing ten single-character comparisons.
If one wants to see if an integer stored in a character variable is in a
set of integers, does one write charInt in [1..10], or would
one need quotes -- charInt in ['1'..'10']?
It's not possible to store an integer in a variable of type
Char in Pascal. Perhaps you're thinking about storing a
character that we ten-fingered types use as a digit -- one of the ASCII
characters in the range from digit-zero to digit-nine -- in
such a variable. In that case, one would write Ch in ['0'
.. '9']. But now the set is a set of characters rather than a set
of integers. The value of the expression to the left of the
in operator must be of the base type of the value of the set
expression to the left of in.
The expression charInt in ['1'..'10'] is syntactically
incorrect in any context, because a structured value, such as a
two-character string, cannot be an element of a set in Pascal.
Are there any major drawbacks to using the standard Pascal set types, aside from the size requirement and requirement for ordinality?
Yes: It's impossible to ``traverse'' the set, performing some operation once on each element, without running through all of the elements of the base type and testing each one for membership in the set. Frequently the cardinality of the set is much smaller than that of the base type, so this operation has a lot of overhead.
Also, the standard Pascal set types permit fewer operations than one might like. Adjoin, disjoin, and empty are easy to write, but things like cardinality and sunder are a little more difficult.
What do the create-singleton, create-doubleton and sunder procedures do?
The create-singleton operation generates and returns a one-member set, given the value that is to become its member. The create-doubleton operation usually generates and returns a two-member set, given the values that are to become its members; if its inputs are identical, it instead returns the one-member set that contains that one value.
The sunder operation generates and returns a subset S' of a given set S, comprising exactly the members of S that satisfy a given condition. For example, if S is a set of integers, one might apply the sunder operation to S and the even operation to obtain the set of all even members of S. I apologize if the name is unfamiliar or unintuitive; it's a translation of the traditional name for the corresponding axiom of set theory, which is the German word `Aussonderung'.
I can understand why the empty set is a special case and requires its own procedure to create, but why is the set with two elements considered a special case? Is create-doubleton really a necessary procedure?
No. In fact, since we have adjoin, neither create-singleton nor create-doubleton is really needed. I put them in just because I was trying to be consistent with the most usual axiomatizations of set theory in mathematics, which include axioms asserting the existence of singleton and doubleton sets.
I noticed that the EveryElementOfBinarySearchTree function
returns True if the tree is nil. That means that the
EveryElementOfSet function will return True if
given the empty set to work with. Why is this desirable?
Because this decision, essentially a convention, removes a special case from the definition of many recursive functions and simplifies code that operates on potentially empty sets. For instance, it seems plausible to hold that if a set S' is formed by adjoining a new element e to a given set S, then every element of S' meets a given condition if, and only if, e meets that condition and so does every element of S. But you have to make an exception to this principle for the case in which S is the empty set, unless you adopt the proposed convention.
This issue actually antedates the computer era; it was a bone of contention in the late nineteenth century between logicians who, following Aristotle, held that the quantifier `every' has existential import -- that a proposition of the form `Every A is B' is false if there aren't any As -- and logicians who, following the practice of mathematicians, held that such a proposition is vacuously true, true precisely because there are no exceptions to falsify it. It is still arguable that in ordinary, non-mathematical English, `every' does have existential import, but the formal systems of logic are so much simpler and more elegant without existential import that it is very seldom recognized.
I'm not clear on what the powerset is. As a result, I'm not sure why it can't be implemented in your module using binary trees. Could it be implemented in Walker's linked-list version? Why did you choose the tree version if it has these drawbacks?
The power set of a given set S is the set of all the subsets of S. For instance, the power set of {1, 2, 3} is { {}, {1}, {2}, {1, 2}, {3}, {1, 3}, {2, 3}, {1, 2, 3} } -- a set of eight members, each of which is itself a set.
The reason that power-set isn't implemented in the Sets
module is that the module defines sets of only one type -- the members of
any set that it deals with must be of type Element. The
output of the power-set operation is not of this type; its members
would have to be of type Congeries.
Of course, it would be possible to duplicate the entire collection of set
operations, so as to be able to deal with sets of Congeries,
which might be described as values of type Metaset; but then
there would once more be a problem with the power-set operation,
which if given a value of type Metaset would have to produce a
set containing members of that type -- a Metametaset, and so
on. I thought it better to avoid the infinite regress.
To implement sets correctly, what one really needs is a language that provides polymorphism -- the possibility of allowing the type of, say, a parameter or return value of a function to be determined by the context in which that function is invoked. Standard Pascal provides only a very limited kind of polymorphism, in the form of conformant array parameters, but a few other programming languages are more flexible.
It would not be any easier to add the power-set operation to the linked-list implementation of sets.
Sets by definition are not ordered, but to make other operations faster it is best to order sets, right?
Right. I know that it seems paradoxical. Think of it this way: When you're building a model of anything, say a bridge, there are inevitably going to be some properties of the model that don't correspond to anything in the thing modelled, but are nevertheless essential to making the model work well. Using Elmer's Glue-All instead of tiny wooden rivets to hold the struts together in the model is not necessarily cheating, even though nothing analogous would suffice for the real-world struts. Similarly, the type used to implement an abstract data type may have properties that the abstract data type itself, by definition, lacks -- such as ordered elements.
The reading about sets today made it clear that a set is considered as orderless, but it was suggested to keep the linked-list structure in order anyway, to make it easier to perform operations on the set. However, if we are going to go to the trouble of keeping it ordered, wouldn't it make more sense to allow the user other manipulations of the data which use the ordering properties?
This would be very useful in some applications. However, you'd be
implementing a different abstract data type. I have no objection to that,
of course. This is like asking whether the Stacks module
wouldn't be more generally useful if you could insert and delete elements
anywhere in the structure; the answer is yes, but then you have a list
rather than a stack.
Why have you chosen not to employ the heap property in creating the binary tree structures to make sets? It seems that the whole purpose for implementing the set data type with binary trees is to make the searches as efficient as possible, and the heap property would ensure that.
Only if you're looking for the element with the largest priority, which is known to be at the root of a heap. If one is looking for an arbitrary value in a heap, and it's not at the root, one has no idea whether to look in the left or the right subtree for the item sought. One has to explore every branch of the tree far enough to encounter either the end of the branch or a datum smaller than the value for which one is searching -- on the average, an O(n) operation.
Imposing the binary search tree ordering instead, as I have done in the implementation in the handout on sets, makes it possible to carry out a search by exploring only one branch -- on the average, an O(lg n) operation.
Is the binary search tree used in the Sets module only
partially ordered?
No, the elements of the binary search tree are totally ordered. Exchanging the positions of any two of them would cause the ordering property to be violated.
Why did you choose to implement the Sets module with binary
trees instead of singly-linked lists? Since we just learned about binary
trees, were you just showing us an example of how to use them, or is there
some sort of benefit to using them in this situation?
If the sets grow very large, a number of the common set operations (such as adjoin) are faster, on the average, when sets are implemented as binary search trees.
In your implementation of the set abstract data type, how exactly would
one define the type of Element?
Here's how it looks when the set members are real numbers:
{ This module defines an Element data type so that data structures and
containers can import it.
Programmer: John Stone, Grinnell College.
Original version: August 2, 1996.
Last revised: December 2, 1996.
}
module Elements;
export
type
Element = Real;
function EqualElement (LeftOperand, RightOperand: Element): Boolean;
function PrecedesElement (LeftOperand, RightOperand: Element): Boolean;
implement
function EqualElement (LeftOperand, RightOperand: Element): Boolean;
begin
EqualElement := LeftOperand = RightOperand
end;
function PrecedesElement (LeftOperand, RightOperand: Element): Boolean;
begin
PrecedesElement := LeftOperand < RightOperand
end;
end.
AdjoinToSet returns a newly augmented set, but in order to
keep track of the storage that our original pointer was pointing at we have
to use temporary pointers. What would be the complications in making
AdjoinToSet a procedure that returned the original pointer
pointing to the newly augmented set after having properly disposed of the
old storage?
Sets would then be mutable from the point of view of the application
programmer, who would invoke the new AdjoinToSet in order to
add a member to an existing set rather than to construct a new set, like
the old one except with one additional member. But if sets are going to be
mutable, there is no sense in deallocating the old storage; the implementer
of a MutableSets module would surely just insert the new
member in the existing binary tree instead of copying it and inserting the
new member in the copy.
Of course, if the application programmer really wants the operation you describe, she can define her own procedure to carry it out:
procedure NewAdjoinToSet (var Sett: Congeries; Adjunct: Element); var Temporary: Congeries; begin Temporary := Sett; Sett := AdjoinToSet (Temporary, Adjunct); DeallocateSet (Temporary) end;How would one go about changing the sets module so that AdjoinToSet would not make a new copy of the set, but simply insert the new element?
The AdjoinToMutableSet procedure would look like this:
procedure AdjoinToMutableSet (var Sett: Congeries; Adjunct: Element); begin InsertIntoBinarySearchTree (Adjunct, Sett) end;The use of the word `digraph' confused me. Is it just interchangable with `graph'?
No. The edges of a graph don't have any direction or orientation; an edge connecting vertex u to vertex v is the same thing as an edge connecting vertex v to vertex u. In a digraph, or directed graph, the ends of each arc (or ``directed edge'') are distinguished; an arc from u to v is different from an arc from v to u. The digraph is the more general notion; a graph can be regarded simply as a digraph that happens to be symmetric, in the sense that whenever there is an arc from u to v, there is also an arc from v to u.
On page 361, Walker refers to ``spanning trees,'' which seem to have something to do with whether a graph is connected. However, his definition is very difficult to understand, and I am still confused as to what exactly a ``spanning tree'' actually is.
Suppose you're given a graph. If it's not connected, it doesn't have any spanning trees. If it is connected, then you get a spanning tree by removing edges from the graph, one by one, without disconnecting it, until no more edges can be so removed. A spanning tree is a subgraph that has as few edges as possible while still providing a path from any vertex to any vertex. (It always turns out that the number of edges in the spanning tree is one less than the number of vertices in the graph.)
Usually there is more than one way to form a spanning tree for a connected graph, and in this case it sometimes makes sense to ask which way is best. If there is a reward or a penalty associated with each edge, one might want an algorithm to find out which of the spanning trees contains edges with the greatest total reward or least total penalty. That's the ``minimum spanning tree'' problem that Walker mentions.
Does a Hamiltonian cycle always exist?
No. The existence of at least one Hamiltonian cycle is a presupposition of the traveling-salesperson problem, but it's easy to devise graphs in which there is no such cycle. A graph with two vertices and no edges would be the simplest example.
How would one go about figuring out the order of complexity of the operations on graphs? It seems that the operations depend on the shape of the graph, but there's no ``average'' graph behavior like there is for binary trees. If this calculation has been done, could you quote results?
Besides the difficulty you describe, there's also the problem that some complexity of a graph algorithm may depend on the number of vertices, the number of edges, or both. The complexity of the algorithm for constructing the minimum spanning tree of a graph that Walker presents on page 381 depends on the particular data structures used to implement it; the best implementation I know of is O(E + V lg V), where V is the number of vertices of the graph and E is the number of edges. For the details, consult chapter 24 of Introduction to algorithms, by Thomas H. Cormen, Charles E Leiserson, and Ronald L. Rivest (Cambridge, Massachusetts: The MIT Press, 1990). The best known algorithms yielding exact solutions to the traveling-salesperson problem are, unfortunately, O(V!). It is not known whether any algorithm for the traveling-salesperson problem could have a complexity of any polynomial order.
How would a programmer decide between depth-first processing and breadth-first processing? Would this depend on having some knowledge about the shape of the graph? (For example, I noticed that Walker's development of the depth-first processing algorithm can be done recursively; the breadth-first routine with a queue would likely be preferable to the recursive depth-first processing routine in the case of a tree that was ``ringy.'')
Occasionally the shape of the graph is relevant, but more often the nature of the problem directs the programmer to one search mode or the other.
I would propose two additions to Walker's list of operations on graphs.
The first I would call a ``maximal spanning tree''; with Walker's example of airfares, it seemes ridiculous, but if you consider a computer game where you are awarded more points for choosing certain paths over others in order to get from start to finish, it makes sense. The object then would be to choose a path which will allow you to accrue the largest possible value while not visiting any vertex more than once.
The second is similar, but I am not sure what I would call it. The operation would be such that it would compute a path which visits every vertex at least once before reaching the ``end,'' and also visits each vertex as few times as possible. Again, this might prove useful in a computer game where you are required to visit, say, every chamber in a castle before being allowed to exit.
These both seem like plausible operations in the game context that you describe. The first one doesn't really involve constructing a spanning tree, though -- just a maximum-length path without cycles, given the endpoints.
Each implementation of the graph type in the book seems to have a limit on the number of edges or on the number of vertices. Is there an implementation that doesn't limit either one?
Yes. You'll find one in the handout on directed graphs.
Are lists and trees the only examples of ordered (or capable of being ordered) graphs? Are there other special graphs which have the ability to be arranged in an order?
The graph structure is extremely general and can model structures with many different kinds of order. Apart from lists and trees, the type most often encountered is the one that mathematicians call a partial ordering -- a digraph that is reflexive (there is an arc from each vertex to itself), transitive (if there is an arc from u to v and an arc from v to w, then there is an arc from u to w), and anti-symmetric (if there is an arc from u to v and an arc from v to u, then u is the same vertex as v).
Lattices are partial orderings in which any two vertices u and v have both a common predecessor (some vertex w such that there are arcs from w to both u and v) and a common successor (some vertex w' such that there are arcs from both u and v to w'); they too are often treated as a distinct kind of ordering.
Why teach lists and trees first? Why not just discuss them as special cases of graphs as pointer structures?
Because lists and trees are conceptually easier and have simpler and more elegant implementations.
I was hoping to find somewhere in the reading an explanation for why graphs are called graphs, but I didn't. Can you explain it for me? How do they relate to mathematical graphs of functions and the like?
The etymology that follows is speculative. The closest thing to a justification that I have for it is an account of various ways of representing functions, given in section 5.1 of The VNR concise encyclopedia of mathematics, edited by W. Gellert et al. (New York: Van Nostrand Reinhold Company, 1975), from which the diagram below is adapted.
When mathematicians started taking a systematic interest in functions with finite domains, domains that aren't always made up of numbers and may not even be ordered in any obvious way, they wanted some kind of a picture to supplement their conceptual understanding, analogous to the pictures of functions of real numbers that analytic geometry provides -- graphs in the older sense of curves plotted on coordinate systems. Instead of trying to invent coordinate systems for such functions, they represented a function by writing out each member of the domain -- each argument for which the function was defined -- and each member of the codomain, and then drew arrows from argument to value. For example, consider a function f that takes the letters from A to G as arguments and yields geometrical shapes -- a circle, a triangle, a rhombus, a square -- as values. Specifically, let's say that f(A) is the triangle, f(B), f(C), and f(D) are the circle, f(E) and f(F) are the square, and f(G) is the rhombus. Here's the corresponding picture:

Such diagrams were also called ``graphs,'' because of their operational similarity to graphs of functions of real numbers.
Sometimes the ovals surrounding the domain and codomain disappeared, especially in cases where the domain and codomain were the same. Then people started drawing similar diagrams to represent relationships other than functions on finite sets, with arrows connecting the things that stood in the relationship. Such a diagram was ``the graph of the relation,'' by analogy with the ``graph of a function'' such as f, and by a kind of metonymy the word `graph' was gradually taken to denote the relational structure itself rather than its written representation.
Let's say we had to buy a given number of farm animals, say pigs, ducks, and cows, and each type of animal had a fixed price. Let's also say that we had to buy a number of each animal, that number falling in a given range, and that the number of each animal puchased is dependent on the number of the other animals purchased -- for example, it might be that for ever duck I buy I also have to buy four cows, or something like that. Would some implementation of graphs be able to tell us the combination of farm animals that would cost the least?
I think you're thinking of a mathematical method called ``linear programming.'' I don't see any straightforward connection with graphs as a data type.
For the identity graph, are none of its vertices connected? How does the traversal algorithm handle a graph where the vertices aren't connected?
In an identity graph, each vertex is connected only to itself. If the vertex set has two or more members, the identity graph as a whole is not connected. The depth-first and breadth-first traversal algorithms given in the handout don't guarantee that they will traverse the entire graph, but only the part of it that is accessible from the given starting point. In the case of an identity graph, this part consists of the starting point and nothing else.
Note, however, that an identity graph is not the same thing as a totally unconnected graph. In an identity graph there is an arc from each vertex to itself; in a totally unconnected graph even these arcs are missing.
What would the algorithm involve to find the path through a map that never stops at the same vertex twice?
One approach would be a depth-first search from the designated starting point, with an extra parameter to keep track of which vertices are on the current branch. The search succeeds when this parameter equals the entire vertex set.
Since edges in non-directed graphs are always bidirectional, is there any way one can make the structure for representing those edges smaller than the one for representing edges of directed graphs? Having an entire two-dimensional array for this purpose seems wasteful, since it's just a reflection of itself.
You can cut the space requirement of an adjacency-matrix representation for edges in an undirected graph almost in half, if the vertices belong to an ordinal type or can be given serial numbers. With an ordinary adjacency matrix, you'd use a doubly-subscripted array reference to determine whether there is an edge between vertex u and vertex v:
if G.Edges[U, V] then { ... }
But if the graph is known to be undirected, the adjacency matrix has
n^2 entries, of which (n^2 - n)/2 are redundant.
Suppose, then, that we instead allocate a one-dimensional array
of size (n^2 + n)/2 and use it to hold just the non-redundant
entries. The only difficulty is how to find the entry that tells us
whether there is an edge between vertex u and vertex v. We
need some easily computed function that takes the ordinal values of the
vertices and produces the correct subscript into the array. If the ordinal
values of the vertices are successive natural numbers beginning with 0, the
following function will work. It always returns a value in the range from
0 to (n^2 + n)/2 - 1, so the Edges array should
have this range as its index type.
type
Natural = 0 .. MaxInt;
function Index (U, V: Vertex): Natural;
var
OrdU, OrdV: Natural;
begin
OrdU := Ord (U); { or U.SerialNumber or whatever }
OrdV := Ord (V);
if OrdU <= OrdV then
Index := (Sqr (OrdU) + OrdU) div 2 + OrdV
else
Index := (Sqr (OrdV) + OrdV) div 2 + OrdU
end;
The if-statement shown above now takes the form
if G.Edges[Index(U, V)] then { ... }
This is, of course, a little slower than a direct reference to an element
of a two-dimensional array. You're trading time for space if you use this
technique.Maybe I have missed it somewhere, but what are some practical applications of these directed graphs? I suppose a street map is a natural possibility, but that seems awfully limited.
Suppose an oil company owns several wells and also maintains several processing plants, and wants to build a network of pipelines that connect all of its wells and processing plants, so that oil can flow from any of the wells to any of the processing plants. This is a variation of the spanning-tree problem. If the company also wants to minimize the total length or the total cost of the pipeline, we have a minimum spanning-tree problem.
The traveling-salesperson problem is a real-world problem; in fact, the same structure appears in several different real-world problems. For instance, suppose that you have a robotic tool that uses a laser to punch a tiny hole at any specified position in a metal sheet. You want to be able to program this tool to punch five thousand holes at specified positions in the sheet, as rapidly as possible. It takes longer to move the laser from one position to another if the positions are far apart, so you'd like to make a minimum-length tour of the five thousand positions of the holes ...
Compilers use graphs in several ways. For instance, a compiler might build a graph in which the vertices are tokens of the programming language being compiled and there is an arc from token u to token v if v can follow u in some legal syntactic construction of the language. The edges can be labelled with indications of what the compiler should do when it finds v after u in the source code.
For more examples, see Donald E. Knuth's book The Stanford GraphBase (New York: ACM Press, 1994), which uses graph algorithms for an incredible variety of applications: game-playing, image processing, literary analysis (the characters in Anna Karenina are vertices, and there is an edge between any two characters who appear in the same chapter of the book), economics (sectors of the economy are the vertices of a complete directed graph, and each arc indicates how many millions of dollars' worth of goods and services flowed from one sector to another in 1985), circuit design, and various mathematical problems.
The book also contains a fanciful program that Knuth uses to prove that if Stanford's 1990 football team had played Harvard's, they would have won by 781 points: See, in 1990 Stanford beat Oregon State by 34, and Oregon State beat Arizona by 14, and Arizona beat New Mexico by 15, and New Mexico beat Texas-El Paso by 20, and ... (etc., etc.) ... and Cornell beat Yale by 10, and Yale beat Harvard by 15. Add up all the differences along the path to find out what would have happened if the two teams at the endpoints had met.
Why should a hash table be 20% larger than the number of values one expects to store in it?
Because the mean time required for a search in a hash table, particularly one that uses linear or secondary probing, increases rapidly after the table is eighty percent full; for instance, it would not be surprising to see the mean search time double as the table goes from eighty percent full to ninety-five percent full. So it's best not to let the table get that full if you have a choice.
In hashing, how may collisions be avoided? If they occur, how does the software recover?
Collisions usually cannot be avoided entirely. One can reduce the frequency of collisions by choosing hash functions that effectively scramble or disguise any patterns in the keys submitted to it.
When collisions occur, an implementation of hash tables may accommodate them by linear probing, by probing using a secondary hash function, or by using lists of records (``buckets'') rather than single records as the elements of the underlying array. These methods are discussed in the handout on hash tables.
On page 400, what was Walker talking about when he said to weight with prime numbers?
He's proposing a design for a hash function suitable for keys that are character strings. Experience has shown that if you simply add the ordinal values of the characters together, or even if you multiply each one by its position in the string and then add the results, the values returned by the hash function tend to collide more often than they should. The ideal hash function is one that gives results that are as evenly distributed as successive values from a random-number generator.
Here's how Walker's hash function would be implemented:
type
Key = String; { from the Strings module }
function HashKey (Opener: Key; ArraySize: Integer): Integer;
var
Total: Integer;
Weight: Integer;
Position: Integer;
Ch: Char;
begin
Total := 0;
Weight := 2; { the least prime number }
for Position := 1 to LengthOfString (Opener) do begin
Ch := RecoverByPositionFromString (Position, Opener);
Total := Total + Weight * Ord (Ch);
repeat
Weight := Weight + 1
until Prime (Weight) and (ArraySize mod Weight <> 0)
end;
HashKey := Total mod ArraySize + 1
end;
The Prime function takes any integer and determines whether it
is prime. (By convention, 0, 1, and -1 are not counted as prime;
otherwise, an integer is prime if its only positive divisors are 1 and its
absolute value.) The following implementation does a reasonably fast job
of testing primality provided that Operand is not too large:
function Prime (Operand: Integer): Boolean;
var
Trial: Integer;
{ a possible divisor of Operand }
PrimeSoFar: Boolean;
{ indicates whether all possible divisors so far tested have failed to
divide Operand }
begin
if Abs (Operand) <= 1 then
Prime := False
else if not Odd (Operand) then
Prime := (Abs (Operand) = 2)
else begin
Trial := 3;
PrimeSoFar := True;
while PrimeSoFar and (Trial * Trial <= Abs (Operand)) do
if Operand mod Trial = 0 then
PrimeSoFar := False
else
Trial := Trial + 2;
Prime := PrimeSoFar
end
end;
What are some of the more interesting hash functions you have seen? On
average, how easy is it to find a situationally appropriate hash function?
Is it usually best just to use a ``standard'' function and not waste time
trying to find one more suited for the particular problem? Here are the ones that I mention but do not explicitly implement in the handout on hash tables:
function HashKey1 (Opener: Key { Real }; ArraySize: Integer): Integer;
const
Phi = 1.618033988749895; { = (1 + Sqrt (5)) / 2 }
var
Multiple: Real;
begin
Multiple := Abs (Opener * Phi);
HashKey1 := 1 + Trunc (ArraySize * (Multiple - Trunc (Multiple)))
end;
function HashKey2 (Opener: Key { String }): Integer;
{ ArraySize must be 128 for this function. }
var
Total: Integer;
Position: Integer;
function Xor (LeftOperand, RightOperand: Integer): Integer;
var
Result: Integer;
Weight: Integer;
BitNumber: Integer;
begin
Result := 0;
Weight := 1;
for BitNumber := 0 to 6 do begin
if Odd (LeftOperand) <> Odd (RightOperand) then
Result := Result + Weight;
Weight := Weight * 2;
LeftOperand := LeftOperand div 2;
RightOperand := RightOperand div 2
end;
Xor := Result
end;
begin { function HashKey2 }
Total := 0;
for Position := 1 to LengthOfString (Opener) do begin
Total :=
Xor (Total, Ord (RecoverByPositionFromString (Position, Opener)));
if Total < 64 then
Total := Total * 2
else
Total := (Total - 64) * 2 + 1
end;
HashKey2 := Total
end;
function HashKey3 (Opener: Key { String }; ArraySize: Integer): Integer;
const
Phi = 1.618033988749895; { = (1 + Sqrt (5)) / 2 }
var
Total: Real;
begin
Total := 0.0;
for Position := 1 to LengthOfString (Opener) do begin
Total :=
Phi * (Total + Ord (RecoverByPositionFromString (Position, Opener));
Total := Total - Trunc (Total)
end;
HashKey3 := 1 + Trunc (ArraySize * Total)
end;
According to Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman
(Compilers: principles, techniques, and tools, Reading,
Massachusetts: Addison-Wesley Publishing Company, 1986), pages 435 and 436,
the following hash function for string keys, written in C and executable on
machines that use thirty-two-bit representations for unsigned integers, has
performed well in experiments with tables of many different sizes:
unsigned hashpjw(char *s) {
char *p;
unsigned h = 0, g;
for (p = s; *p; p++) {
h = (h << 4) + (*p);
if (g = (h & 0xf0000000)) {
h ^= (g >> 24);
h ^= g;
}
}
return h % ARRAY_SIZE;
}
Aho et al. ascribe this algorithm to P. J. Weinberger.If you have a fixed set of keys and are going to be doing very large numbers of searches using them, it may be worth while to try to find a ``perfect'' hash function that takes each key into a different array subscript, thus avoiding collisions entirely. Be warned, however, that it's usually almost impossible to find such functions by brute-force search, and that the algorithms for constructing them are difficult and slow.
Why is it best for the number of values that can be stored in a hash table to be a prime number, and why, for secondary hashing, should it be two greater than a prime number?
When a proposed hash function turns out not to work very well in practice, the reason is usually that it does not adequately conceal patterns in the keys, so that collisions occur too frequently. In class, I gave an extreme example of this defect: A hash function that took telephone numbers as keys and returned the first three digits of the number as its hash code would give terrible results -- in Grinnell, almost all the telephone numbers would hash to 236 and 269!
In many applications, it turns out that keys are assigned in such a way
that there is a pattern in their factors, often one that is not anticipated
by the programmer. If the hash function is HashKey, from the
handout on hash tables, and
ArraySize happens to be divisible by some small integer, such
patterns may inflate the collision rate, since the resulting hash codes
will show the same bias as the keys. If, however, ArraySize
is a prime, dividing the key by ArraySize is likely to produce
hash codes that are fairly evenly distributed despite the bias in the keys.
Using a large prime as the value of ArraySize eliminates
biases arising from any factors except multiples of ArraySize
itself.
In a hash table that uses a secondary hash function for probing, such as
ProbeWithKey in the handout, one wants both to hash any
information content in the key and to arrange that successive probes with
the same key will yield different positions in the array. The critical
loop is in the FindPosition function and looks like this:
Looking := True;
Position := HashKey (Opener, ArraySize);
while Looking do
if EqualKeys (Arr[Position].K, Opener) then begin
Found := True;
Looking := False
end
else if InUse (Store, Position) then
Position := ProbeWithKey (Position, Opener, ArraySize)
else begin
Found := False;
Looking := False
end
If the while-loop is executed repeatedly, what happens is that
the array position initially returned by HashKey is occupied
by a different value, so ProbeWithKey is invoked to give a
different position; but that turns out also to be occupied by a different
value, so ProbeWithKey is invoked again to give a
different position; and so on and so on. One wants to ensure that
ProbeWithKey not repeat any position in this sequence until
all or almost all of the other positions in the array have been tried, so
that if there is a vacancy anywhere in the array, repeated calls to
ProbeWithKey will find it. Choosing ArraySize to
be a prime that is also two greater than a prime ensures that the
ProbeWithKey function meets both requirements.Unbucketed hash tables -- I think I understand the concept of assigning weights to each position in an array of characters to determine where a word or other data item should be stored within an array, but I can't figure out what a good way to determine where to leave blank spaces would be. Walker says, on page 400, that spaces can be left at various positions as needed ... how does one figure out where those spaces need to be? I realize this might vary with what you're trying to store -- let's say glossary of terms entered one by one by the user ... how does the programmer decide where to leave spaces, and how many spaces to leave?
When it is created, every position in the hash table is an empty space; marking each position in the underlying array with some conventional ``no data here'' indicator is part of the initialization of the structure. Subsequently, values are added to the table one by one; applying the hash function to the key with which the value is associated gives you the position at which the value should be stored. The ``blank spaces'' are simply the slots that never happen to get filled during the insertion phase; the programmer does not have to decide where to leave them.
Why would an unbucketed hash table be preferable to a bucketed one? It seems that there is a greater potential to invoke the rehash function, especially if there is not enough extra room in the data structure.
You're quite right -- indeed, a hash table that uses buckets doesn't need a secondary hash function at all. The only good reason for using an unbucketed hash table is that you may not be able to use dynamically allocate storage for your table implementation, either because you're working in a language that doesn't support it, or because the application doesn't permit it. For example, in maintaining a file-allocation table on a floppy disk, you have a fixed amount of storage -- one track, perhaps -- that can be used for storing the names, sizes, and track and sector numbers of files held on the disk, and you'd like to be able to do constant-time lookup of the information about a file, given its name. An hash table using linear or secondary probing rather than bucketing is just what you need.
I'm not sure why one would want to use a hashing algorithm when there seem to be so many better faster ones out there. How do these (I realize that there may be any number of different algorithms) in general compare with others?
Both insertion and search in a hash table are O(1) algorithms. You've got something faster than this?!
Admittedly, there's some overhead in the call to the hash function -- not every hash function is as fast as one would like it to be. But the hash function runs in the same amount of time regardless of the size of the table, whereas in a binary search tree the searches slow down as the binary tree grows.
Do hash functions change substantially when they are designed for different numbers of elements? In The Java Programming Language, I noticed that the authors point out that delays may result when hash tables are rebuilt. Does this result from a new hash function, or merely having to rehash all the elements in a larger table?
(Context for students unfamiliar with Java: In Java, one of the standard
libraries provides a Hashtable data type which has the nice
feature that when number of values stored in the hash table exceeds
three-fourths of the size of the underlying array, the insert
operation automatically allocates storage for a larger array and transfers
all the previously inserted elements from the old array to the new one; the
old array is then eligible for recycling.)
The hash function used is likely to be identical; the delay that results is
a consequence of having to rehash and insert every value stored in the old
array so that it will be in the correct position in the new one. (The
position number delivered by the hash function for a given key will change
when the size of the array changes, since ArraySize is a
parameter used in the hash function.)
I've heard the term `hash table' before and think I've seen it used, but I'm not sure. Is this method used somewhere in the Unix (or ULTRIX) operating system? If so, can you give an example as to where it is used? I might understand it better if I can see a concrete example (outside of the text).
There are lots of places where tables are used in Unix. For instance, the Unix kernel maintains a process table, keyed by process ID; the password file is a table of account records, keyed by username; a directory is a table of files, keyed by filename, and so on. A hash-table implementation could be used for any of these, though I imagine that most versions of Unix use arrays.
A cache memory is a sort of hash table in hardware in which the hash function involves selecting some sequence of bits from the address at which the cached value is actually supposed to be stored, and collisions are ``handled'' by copying the previously cached value out to memory if necessary and then overwriting it with the new value.
As I mentioned in class, it is quite common for the symbol table in a compiler to be implemented as a hash table. Chapter 3 (``Symbol management'') of A retargetable C compiler: design and implementation, by Christopher Fraser and David Hanson (Menlo Park, California: Addison-Wesley Publishing Company, 1995), gives a detailed description of symbol table management in a C compiler, complete with source code (in C).
Is hashing mostly just usable for sorting (by copying an unsorted list into a hashed list)? Are there any other uses?
Hashing is not used for sorting, since the order in which the values appear in the hash table does not reflect any natural ordering either of the values themselves or of the keys.
Although the primary use of hashing is the construction of data structures for fast searching, I know of two programming languages (Common Lisp and Java) in which there is a predefined function that returns a hash code for any given object; applications might use such hash codes to speed up equality tests for objects with complex internal structures.
I've heard the Unix password encryption algorithm (crypt)
described as a ``one-way hash function.'' Does this have any relation to
hash tables or hashing in general?
The term `hashing' comes from the English verb `to hash', in the sense of
confusing or muddling something. The crypt function takes a
user's password and a two-character ``salt'' and returns a encrypted
version. The information content of each bit the password is distributed
over all the characters of the result; changing one bit of the password
will change as many as sixty-six bits of the resulting string. So it's
very difficult to determine anything about the password by inspecting the
encrypted version, even if you know how the crypt function
works; that's the sense in which it's a ``one-way'' function.
Hash functions, as used in implementations of tables, are similarly supposed to distribute all of the information content of a key over the entire range of possible array subscripts. Even if there is a pattern common to many of the keys submitted, this pattern should not be discernible in the results produced by a hash function -- because if it is, there are going to be too many collisions. So the theory of hash tables and the theory of one-way encryption functions are closely connected by the idea of using some algorithm to disguise or suppress the information content of values of some data type.
Would it be possible to make a bucketed hash table with a rehash function? For example, could you set a limit to the size of the small lists and call the rehash function if this limit is reached?
Yes. One could either add such an operation to the abstract data type or, better, incorporate it into the insert operation. If adding a new datum would make the hash table more than eighty percent full, then the insert operation should allocate a larger array, insert each of the items in the existing hash table into the larger array, and deallocate the old array, replacing it with the new one. Languages that support a table data type directly (SNOBOL, Icon, Common Lisp, Java) generally provide such a facility.
The problem with implementing a rehash operation in Pascal is that unless one anticipates the size of every array that might be needed and declares an appropriate data type for each size, there's no way to allocate and deallocate the arrays dynamically. HP Pascal provides a way around this, but it's not easy to use. In case you're interested, here are the details:
What HP Pascal allows you to do is allocate and deallocate a block of
storage of any size (as measured in bytes), with any desired alignment, and
then to treat that block as if it were carved up into array elements. The
allocation procedure is called P_GetHeap, and the deallocation
procedure is P_RtnHeap. They are predefined procedures that
have, in effect, the following headers:
procedure P_GetHeap (var RegionPointer: LocalAnyPtr; RegionSize: Integer; Alignment: Integer; var OK: Boolean); procedure P_RtnHeap (var RegionPointer: LocalAnyPtr; RegionSize: Integer; Alignment: Integer; var OK: Boolean);The first argument to either procedure can be a variable of any pointer type. The
P_GetHeap procedure stores the address of the first
byte of the allocated storage into that variable. The
RegionSize parameter indicates how many bytes of storage
should be allocated; the Alignment parameter, which must have
one of the values 1, 2, 4, 8, 16, 32, 64, or 2048, indicates whether the
chunk of storage should be byte-aligned, halfword-aligned, word-aligned,
..., 64-byte-aligned, or page-aligned. P_GetHeap sets its
OK parameter to True if it succeeds in finding a
free block of storage with the required properties, to False
if the allocation fails.
To deallocate that block of storage, invoke P_RtnHeap with its
address as the first parameter and the same values for
RegionSize and Alignment that were used to
allocate it. P_RtnHeap sets OK to
True if it succeeds in deallocating the block, to
False if it does not.
For example, suppose we want to allocate space for a hash table of size
1609 in which the keys are Social Security numbers (of type array [1
.. 9] of Char) and the values are 196-byte records requiring word
alignment. Here's how the creation procedure might look:
const
InitialArraySize = 1609;
SizeOfKeyAndValue = 208; { one key-and-value record occupies 208 bytes }
type
KeyAndValue = record
K: Key;
V: Value
end;
TablePointer = ^KeyAndValue;
Table = LocalAnyPtr;
{ A table is a block of dynamically allocated storage in which the
first four bytes will contain the current size of the table and the
next four its current load (the number of values currently stored
in it); the remaining bytes will be occupied by a sequence of
208-byte KeyAndValue records -- as many of them as the current size
field says there are. }
function CreateTable: Table;
var
Result: Table;
{ the hash table under construction }
Success: Boolean;
{ indicates whether the allocation succeeded }
Header: ^Integer;
{ points to one of the integer fields at the beginning of the block }
Index: Integer;
{ counts off the components of the hash table }
Cursor: TablePointer;
{ points to successive components of the hash table proper }
begin
P_GetHeap (Result, 4 + 4 + InitialArraySize * SizeOfKeyAndValue,
4, Success);
{ Assert (Success); }
Header := Result;
Header^ := InitialArraySize;
Header := AddToPointer (Header, 4);
Header^ := 0;
Cursor := AddToPointer (Result, 8);
for Index := 1 to InitialArraySize do begin
AssignAbsentValue (Cursor^.V);
Cursor := AddToPointer (Cursor, SizeOfKeyAndValue)
end;
CreateTable := Result
end;
AddToPointer is a predefined HP Pascal function that takes a
pointer of any type, reinterprets it as an unsigned integer, adds an
integer increment to it, reinterprets the result as a pointer of the same
type, and returns that pointer result. Here I've used it to work my way
through the block of storage that P_GetHeap allocates, placing
the initial table size in the first four bytes, the initial load (0) in the
next four, and assigning the conventional ``absent'' value to each of the
components of the ``array'' making up the table.
Subsequently, instead of referring to, say, Arr[Position] to
pick out one element of the hash table, one would have to compute its
address and access it through that address:
Pointer := AddToPointer (Result, 4 + 4 + (Position - 1) * SizeOfKeyAndValue);
if EqualKeys (Pointer^.K, Opener) then { ... }
All this is pretty cumbersome, but it would make it possible to rehash the
table, increasing its size as needed:
procedure RehashTable (var Store: Table);
var
Header: ^Integer;
OldSize: Integer;
NewSize: Integer;
NewStore: Table;
Success: Boolean;
Cursor: TablePointer;
Index: Integer;
begin
Header := Store;
OldSize := Header^;
NewSize := NextPrimeAfter (2 * OldSize + 1)
P_GetHeap (NewStore, 4 + 4 + NewSize * SizeOfKeyAndValue, 4, Success);
{ Assert (Success); }
{ Store the new header information. }
Header := NewStore;
Header^ := NewSize;
Header := AddToPointer (Header, 4);
Header^ := 0;
{ Initialize the new ``array.'' }
Cursor := AddToPointer (NewStore, 8);
for Index := 1 to NewSize do begin
AssignAbsentValue (Cursor^.V);
Cursor := AddToPointer (Cursor, SizeOfKeyAndValue)
end;
{ Copy all the values from the old hash table into the new one. }
Cursor := AddToPointer (Store, 8);
for Index := 1 to OldSize do begin
if not AbsentValue (Cursor^.V) then
InsertInTable (NewStore, Cursor^.K, Cursor^.V);
Cursor := AddToPointer (Cursor, SizeOfKeyAndValue);
end;
{ Discard the old hash table and replace it with the new one. }
DeallocateTable (Store);
Store := NewStore
end;
How would one come up with, say, the value of phi in the hash function
you displayed on the questions page?
post% scheme Chez Scheme Version 5.0c Copyright (c) 1994 Cadence Research Systems > (/ (+ 1 (sqrt 5)) 2) 1.618033988749895Perhaps you're asking why one would choose phi as a multiplier in the first place. The answer is that the fractional parts of successive multiples of phi are distributed over the interval [0, 1) in the most uniform way possible: If one plots the fractional parts of these successive multiples on a number line representing that interval, so that the interval can be seen as separated into segments by the previously plotted points, each new point is placed in one of the largest remaining segments, and no segment is ever more than phi-squared times as large as any other, which is optimal. Proving this is part of exercise 9 in section 6.4 in Donald E. Knuth, Sorting and searching, volume 3 of The art of computer programming (Reading, Massachusetts: Addison-Wesley Publishing Company, 1973); Knuth's answer to this exercise is found on page 688.
In hashing the elements are stored in a location that some formula appears to make random. This makes the number easy to look for. But if you wanted to print the entire table in order, would hashing be comparable to a binary tree?
No. If you have a lot of values stored in a hash table, the only good way to print them out in order is to copy them into some other data structure and sort the result. The order in which they would be encountered during a traversal of the hash table does not correspond either to a natural order of the values or to a natural order of the keys.
How important is the use of statistics to the design of hash functions?
In theory, it's extremely important. In practice, programmers tend not to look very hard at the statistical justification for the hash functions they use; if it's in the library, it's good enough. The people who write the libraries therefore have an immense responsibility to study the statistical justifications carefully when selecting their algorithms.
The C shell man page says that the shell maintains a hash table of commands so that it is easier to search for frequently used commands. Do you know what is stored in the table?
The keys are the command names and the values are the pathnames of the corresponding executable files.
I noticed that hash tables use keys. Can they be used to do encryption?
A hash function of the kind that is used in a hash table is useless for encryption, because it's not one-to-one -- many different ``plaintexts'' hash to the same ``ciphertext.'' (That's why there are collisions.) So there would be no algorithm for decrypting the ciphertext -- you could never determine which of several possible plaintexts had been enciphered to produce the given ciphertext.
Functions that are used for encryption are sometimes also called ``hash functions,'' because they ``hash'' the information content of the plaintext (i.e., they confuse or muddle it, spreading out the content of each bit of the plaintext over the entire ciphertext), but because encryption functions have to be one-to-one they are not useful in hash tables. What leads one to use a hash table in the first place is that the range of possible keys is too large; a one-to-one hash function would produce hash codes in a range that is just as large as the original range of keys!
Please give some practical examples where hash tables have been implemented.
As I mentioned in class, it is quite common for the symbol table in a compiler to be implemented as a hash table. Chapter 3 (``Symbol management'') of A retargetable C compiler: design and implementation, by Christopher Fraser and David Hanson (Menlo Park, California: Addison-Wesley Publishing Company, 1995), gives a detailed description of symbol table management in a C compiler, complete with source code (in C). Fraser and Hanson's hash tables use buckets.
In the questions a few days ago, I mentioned that Donald E. Knuth's book The Stanford GraphBase (New York: ACM Press, 1994) contains a whimsical program that constructs tendentious analyses of football scores. This program uses a hash table of size 1009, with buckets, to store the full names of the teams involved (e.g., ``Wake Forest Demon Deacons''), using a short abbreviation of the team name (e.g., WAKE) as the key. See pages 228 and 229 of the book for details.
Knuth is also the author of the TeX document compiler and typesetter. The complete source code for TeX, published as TeX: the program (Reading, Massachusetts: Addison-Wesley Publishing Company, 1986), includes the code for a hash table (of size 1777) in which the definitions of TeX's ``control sequences'' -- the procedures of the markup language -- are stored. In this case, Knuth used a technique called ``coalescing lists,'' which essentially means manually allocating storage for the buckets -- the linked lists -- from within an overflow area in a statically allocated array. See part 18 of TeX: the program (sections 256 through 267, pages 107-113) for details.
How does one invoke procedures and functions containing procedural and functional parameters?
In the argument position of the call, one writes the name of a programmer-defined procedure or function -- just the name by itself, not a call to it. The procedure or function that is used as an argument must have been defined at a point in the program that precedes its use as an argument.
When invoking a procedure or function, imported from a module, that has a functional parameter, is it necessary to define, in my own program, the function that is used as the argument corresponding to the functional parameter?
It's necessary for that function to be defined somewhere. If it's not defined in the module containing the procedure or function that has the functional parameter, or in any other module that you import, then you have to define it yourself.
What does ``procedure alignment'' refer to? The address where the code for the procedure starts?
Not quite. It is possible, in Pascal, to pass a procedure or function as a parameter to another procedure or function; when this is done, however, the Pascal implementation actually has to build a data structure known as a closure containing the address at which the code for the procedure or function starts, the values of any non-local identifiers that the procedure or function refers to, and possibly other useful information. ``Procedure alignment'' is the alignment used for this data structure.
If we can pass functions to procedures as part of the parameter list, why
stop there? Why shouldn't we be able to declare function-variables? Why not
function-types? Why not this: for f := ln to square do
f(x)?
In some languages, such as Pascal's near-twin Modula-2, you can define procedure and function types and declare variables of such types. That it is not permitted in Pascal is nothing more than historical accident -- people didn't fully realize how useful the facility would be until after Pascal was standardized.
It is unlikely that a Pascal-like language would allow a procedure or function variable to be a loop control variable, since only ordinal types are permitted in such a context, and there is no natural ordering of functions or procedures. But in Modula-2 one could declare an array of functions, store the various desired functions in the array, and loop through the indices of the array, selecting each function by its index.
I thought of a possible problem with Pascal internal security. In order
to prevent one from destructively-updating a variable within a procedure,
one can remove the var part in the procedure declaration, thus
making the parameter pass-by-value. However, this procedure could then
call another with the same variable, but within this procedure's
declaration var the variable, thus making it
pass-by-reference. Am I just being paranoid or is this a danger?
Let's consider a program that implements the arrangement you describe. The
main program calls Trusted, which has a value parameter and
hence promises not to make any change in the corresponding argument; but
Trusted calls Saboteur, which has a variable
parameter and sneakily replaces the value stored in the corresponding
argument.
program Foo (Output);
const
BadValue = -1;
GoodValue = 42;
var
Bar: Integer;
procedure Saboteur (var Bar: Integer);
begin
Bar := BadValue;
WriteLn ('In Saboteur: Bar = ', Bar : 1)
end;
procedure Trusted (Bar: Integer);
begin
WriteLn ('In Trusted (before call to Saboteur): Bar = ', Bar : 1);
Saboteur (Bar);
WriteLn ('In Trusted (after call to Saboteur): Bar = ', Bar : 1)
end;
begin
Bar := GoodValue;
WriteLn ('In main program (before call to Trusted): Bar = ', Bar : 1);
Trusted (Bar);
WriteLn ('In main program (after call to Trusted): Bar = ', Bar : 1);
end.
Here's the output from this program:
In main program (before call to Trusted): Bar = 42 In Trusted (before call to Saboteur): Bar = 42 In Saboteur: Bar = -1 In Trusted (after call to Saboteur): Bar = -1 In main program (after call to Trusted): Bar = 42
Saboteur has managed to store the bad value into the parameter
Bar within Trusted, which foolishly invoked it,
but this has no effect on the main program's variable Bar,
because the main program provides to the Trusted procedure
only the value of this variable and not the actual location in which it is
stored.When a procedure that has a value parameter is invoked, a separate, otherwise unused storage location is allocated and the value of the argument is copied into that new storage location. If the contents of this new storage location are changed, either by an assignment to the value parameter or by passing it by reference to another procedure or function, that still has no effect on the original argument. On the other hand, when a procedure that has a variable parameter is invoked, the parameter becomes an alias for the corresponding argument -- an alternative name for exactly the same storage location. So assignments to the parameter affect the corresponding argument as well.
Thus the answer to your question is that so long as the caller provides information to other procedures and functions only through value parameters, the values of its own variables cannot be modified. The designers of Pascal were acutely aware of the possibility you envision, but the design they came up with does effectively prevent the breach of security that you were concerned about.
Last year I wrote a Pascal program in which I had a Boolean variable to which I assigned the result of applying a logical operator -- something along the lines of
B {the Boolean} := num > 36;
Then I just had an if-statement which began if B then
... Although my program worked with this feature, the teacher told
me to be very careful with that type of construction. What is so dangerous
about this use of Booleans and is it even useful to use Booleans in this
manner in the first place?It's really a stylistic point. Some people avoid assignments to Boolean variables because they're supposedly hard to read. I generally enclose the right-hand side of such assignments in parentheses, thus:
B := (num > 36);Even though the parentheses are theoretically superfluous, the psychological effect on the reader is to force him to notice that the Boolean expression will be completely evaluated before any part of the assignment is attempted.
In the particular example you've cited, you don't really need the Boolean
assignment unless you're going to use the value of B again at
some later point; you could simply write if num > 36 then, and
so on. But Boolean assignments are often extremely useful, especially in
controlling the order in which tests are performed.
When is a recursive procedure quicker than an iterative one?
When the language implementation carefully optimizes function calls and returns, and when the body of the recursive procedure is much simpler than the body of the iterative one. Specifically, iterative procedures sometimes spend a lot of time shuffling data values around so that values that are no longer needed are overwritten with current ones; recursion just leaves them in place and starts up a new function call instead. Compare recursive and iterative functions for computing the greatest common divisor of two natural numbers:
function RecursiveGCD (First, Second: Integer): Integer;
begin
if Second = 0 then
RecursiveGCD := First
else
RecursiveGCD := RecursiveGCD (Second, First mod Second)
end;
function IterativeGCD (First, Second: Integer): Integer;
var
Remainder: Integer;
begin
while Second <> 0 do begin
Remainder := First mod Second;
First := Second;
Second := Remainder
end;
IterativeGCD := First
end;
The two procedures generate exactly the same sequence of pairs of values
for the variables First and Second. In the
recursive version, they show up as arguments to successive function calls;
in the iterative version, as variable values on successive iterations of
the while-loop. The recursive version makes extra function
calls and returns; the iterative version performs extra assignments. In
most Pascal implementations the iterative version is faster in this case,
because function calls are more expensive than assignments. But in Scheme
the reverse would be true.I've been told that `hard-coding' paths into programs can result in agony later on, and I've seen some of the results. I assume that if Pascal is not given an absolute path it looks in the current working directory, but can it be told to search in certain paths for files?
Not in HP Pascal. Actually, I don't think I've ever used a version of Pascal that provides this facility.
Since you expressed such a bitter dislike for label, under what circumstances would you consider it preferable to use them?
When the alternatives are no better. For instance, here's a little stretch of code from one of the handouts later in the semester:
while True do
if Operand^.Cursor = Nil then
goto 99
else if Test (Operand^.Cursor^.Datum) then
goto 99
else
Operand^.Cursor := Operand^.Cursor^.Next;
99:
The idea is that I want to execute the statement Operand^.Cursor :=
Operand^.Cursor^.Next repeatedly until either it becomes
Nil or the call to the Test function succeeds;
but I can't write
while (Operand^.Cursor <> Nil) and not Test (Operand^.Cursor^.Datum) do
Operand^.Cursor := Operand^.Cursor^.Next
because some Pascal processors will try to evaluate the not Test
(Operand^.Cursor^.Datum) condition even after discovering that
Operand^.Cursor <> Nil is false, and the program will crash
when the Nil pointer is dereferenced.
One could avoid both the problem and the goto-statement by
declaring a Boolean variable Continue and writing
Continue := True;
while Continue do
if Operand^.Cursor = Nil then
Continue := False
else if Test (Operand^.Cursor^.Datum) then
Continue := False
else
Operand^.Cursor := Operand^.Cursor^.Next
but this seems just as unnatural and difficult as using the
goto-statement.
I noticed you used labels in the Ratios module. Why was
this the best option? How much work would it have been to not use a label
-- or would it not have been possible to write the procedure without a
label?
I used one label, in the ReadRatio procedure. It could have
been done without the label, by using the value of the Success
parameter to direct the flow of control around the parts of the procedure
to be skipped. Here's what the body of the procedure would look like
without a label; you can judge for yourself whether it's better or worse:
begin { procedure ReadRatio }
SkipWhiteSpace (Source);
if EOF (Source) then
Success := False
else begin
{ Recover the sign of the ratio. }
if Source^ = '-' then begin
S := Negative;
Get (Source);
Success := not EOF (Source)
end
else if Source^ = '+' then begin
S := Nonnegative;
Get (Source);
Success := not EOF (Source)
end
else begin
S := Nonnegative;
Success := True
end;
if Success then begin
{ Read in the numerator. }
ReadNatural (Source, N, Success);
if Success then begin
{ Deal with the slash, if it is present. }
if EOF (Source) then
Slash := False
else
Slash := (Source^ = '/');
if Slash then begin
{ Read in the denominator. }
Get (Source);
ReadNatural (Source, D, Success);
if ZeroNatural (D) then begin
Success := False;
DeallocateNatural (D)
end;
if not Success then
DeallocateNatural (N)
end
else
D := PascalIntegerToNatural (1);
if Success then begin
if ZeroNatural (N) then
S := Nonnegative;
Legend := BuildAndReduce (S, N, D, True);
if Debug then
Assert (ValidRatio (Legend), InvalidRatioException,
RatioExceptionHandler)
end
end
end
end
end;
The stylistic problem with this code is that the if-statements
are nested so deeply that it's hard to keep track of which assumptions are
in effect at any given point.Every time I allocate storage for a variable, the program has to, somewhere, create another piece of information that records what type of variable it is. So an integer doesn't take up just one byte, but more than that, including the description of the variable's type, correct?
Under HP Pascal, an integer is normally stored in four bytes, not one. But this is has nothing to do with a description of its type.
While the program is being compiled, the pc compiler keeps track of the type of each variable in a data structure, usually an array of records, called a symbol table. This enables the compiler to make user that the programmer has applied to that variable only operations that are appropriate for its type; the compiler is supposed to stop and report an error if it finds that a variable declared to be of one type is actually used as if it were of a different type.
The symbol table exists during compilation, but the storage associated with the actual variable does not. The compiler, after all, doesn't execute the program; it only tells how to execute it. It keeps track of where the variable will be placed in memory when the program is eventually executed, but it does not actually put anything in that storage location during compilation.
On the other hand, once the compilation is over, the symbol table is no longer needed. The type checking has all been completed. During program execution, the storage in which an integer value is placed does not include any indication that the value stored there is an integer.
There are some programming languages in which each value carries with it an indication of the type it belongs to -- Scheme is one example. The reason is that in Scheme it is possible to postpone most type checking until the program is actually executing; the compiler need not (and in some cases cannot) determine whether the data types match up correctly, because the type of value that is stored in a Scheme variable is sometimes not known until the program is running.
Could you give some stats on the MathLAN workstations (e.g., RAM, swap size, hard-disk size, processor speed and type)? If Newton differs, how so?
All but one of the MathLAN workstations are Hewlett-Packard Model 712/60, 9000 Series. The central processor is a PA7100LC, developed by HP, operating at a clock frequency of 60 megahertz. The rating of this processor on the SPECint92 benchmark is 58.1, on the SPECfp92 benchmark, 79. It has a separate memory-management unit that allows virtual memory addressing up to forty-eight bits.
The main memory on each machine consists of two sixteen-megabyte single in-line memory modules (SIMMs), capable of correcting any single-bit error in a byte automatically and detecting simultaneous errors in two bits of a byte. The main memory bus is seventy-two bits wide (sixty-four data bits and eight ``check bits'' for error detection) and can transfer data at a peak rate of 160 megabytes per second.
In addition, each workstation maintains a sixty-four-kilobyte direct-mapped external cache memory, connected to the processor by a sixty-four-bit bus operating at a peak rate of 400 megabytes per second.
Each workstation has a 525-megabyte internal hard disk, two hundred megabytes of which is used for swap space (that is, virtual memory for the workstation); a three-and-a-half-inch drive for 1.44-megabyte floppy disks; a 1280-by-1024-pixel color monitor (with a diagonal measurement of either seventeen or twenty inches), a PC-101 keyboard, and a three-button mouse.
The file server, newton, is a Hewlett-Packard Model 715/75, 9000 Series. Its central processor is a PA7100 operating at a clock frequency of 75 megahertz. The rating of this processor on the SPECint92 benchmark is 61, on the SPECfp92 benchmark, 113. It has 128 megabytes of main memory and a total of eight gigabytes of hard-disk storage, divided over five physical disks. A 600-megabyte CD-ROM drive and a two-gigabyte digital audio tape drive are also attached.
Additional questions (for the compulsively curious)