Characters

Character codes: ASCII and Unicode

A character is a small, repeatable unit within some system of writing -- a letter or a punctuation mark, if the system is alphabetic, or an ideogram in a writing system like Han (Chinese).

When a character is stored in a computer, it must be represented as a sequence of bits -- ``binary digits,'' that is, zeroes and ones. However, the choice of a particular bit sequence to represent a particular character is more or less arbitrary. In the early days of computing, each equipment manufacturer developed one or more ``character codes'' of its own, so that, for example, the capital letter A was represented by the sequence 110001 on an IBM 1401 computer, by 000001 on a Control Data 6600, by 11000001 on an IBM 360, and so on. This made it troublesome to transfer character data from one computer to another, since it was necessary to convert each character from the source machine's encoding to the target machine's encoding. The difficulty was compounded by the fact that different manufacturers supported different characters; all provided the twenty-six capital letters used in writing English and the ten digits used in writing Arabic numerals, but there was much variation in the selection of mathematical symbols, punctuation marks, etc.

In 1963, a number of manufacturers agreed to use the American Standard Code for Information Interchange (ASCII), which is now by far the most common and widely used character code. It includes representations for ninety-four characters selected from American and Western European text, commercial, and technical scripts: the twenty-six English letters in both upper and lower case, the ten digits, and a miscellaneous selection of punctuation marks, mathematical symbols, commercial symbols, and diacritical marks. (These ninety-four characters are the ones that can be generated by using the forty-seven lighter-colored keys in the typewriter-like part of a MathLAN workstation's keyboard, with or without the simultaneous use of the <Shift> key.) ASCII also reserves a bit sequence for a ``space'' character, and thirty-three bit sequences for so-called control characters, which have various implementation-dependent effects on printing and display devices -- the ``newline'' character that drops the cursor or printing head to the next line, the ``bell'' or ``alert'' character that causes the workstation to beep briefly, and such like.

In ASCII, each character or control character is represented by a sequence of exactly seven bits, and every sequence of seven bits represents a different character or control character. There are therefore 27 (that is, 128) ASCII characters altogether.

Over the last quarter-century, non-English-speaking computer users have grown increasingly impatient with the fact that ASCII does not provide many of the characters that are essential in writing other languages. A more recently devised character code, the Unicode Worldwide Character Standard, currently defines bit sequences for 34168 characters for the Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek, Gujarati, Gurmkhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Latin, Lao, Malayalam, Oriya, Tamil, Telugu, and Thai writing systems, as well as a large number of miscellaneous numerical, mathematical, musical, astronomical, religious, technical, and printers' symbols, components of diagrams, and geometric shapes.

Unicode uses a sequence of sixteen bits for each character, allowing for 216 (that is, 65536) codes altogether. Many bit sequences are still unassigned and may, in future versions of Unicode, be allocated for some of the numerous writing systems that are not yet supported. The designers are actively working on Burmese, Cherokee, Cree, Ethiopic, Khmer, Maldivian, Mongolian, Moso, Pahawh Hmong, Rong, Sinhalese, Tai Lu, Tai Mau, Tibetan, Tifinagh, and Yi.

Although our local Scheme implementations use and presuppose the ASCII character set, the Scheme language does not require this, and Scheme programmers should try to write their programs in such a way that they could easily be adapted for use with other character sets (particularly Unicode).

Characters in Scheme

In Scheme, a name for any of the characters can be formed by writing #\ before that character. For instance, the expression #\A denotes the capital A, and the expression #\? denotes the question mark. (Control characters, however, usually cannot be named in this way.) In addition, the expression #\space denotes the space character, and #\newline denotes the newline character. So, for instance, (display #\newline), which writes out the newline character, is exactly equivalent to (newline).

In any implementation of Scheme, it is assumed that the available characters can be arranged in a linear order (the ``collating sequence'' for the character set), and each character is associated with an integer that specifies its position in that order. In ASCII, the numbers that are associated with characters run from 0 to 127; in Unicode, they lie within the range from 0 to 65535. (Fortunately, Unicode includes all of the ASCII characters and associates with each one the same collating-sequence number that ASCII uses.) Applying the built-in char->integer procedure to a character gives you the collating-sequence number for that character; applying the converse procedure, integer->char, to an integer in the appropriate range gives you the character that has that collating-sequence number.


Exercise 1

Determine the ASCII collating-sequence numbers for the capital letter A and for the lower-case letter a.


Exercise 2

Find out what ASCII character is in position 38 in the collating sequence.


The importance of the collating-sequence numbers is that they extend the notion of alphabetical order to the all the characters. Scheme provides five built-in procedures for comparing characters (char<?, char<=?, char=?, char>=?, and char>?), and they all work by determining which of the two characters comes first in the collating sequence (that is, which one has the lower collating-sequence number).

Scheme requires that if you compare two capital letters or two lower-case letters, you'll get standard alphabetical order: (char<? #\A #\Z) must be true, for instance. If you compare a capital letter with a lower-case letter, though, the result depends on the design of the character set. (In ASCII, every capital letter -- even #\Z -- precedes every lower-case letter -- even #\a.) Similarly, if you compare two digit characters, Scheme guarantees that the results will be consistent with numerical order: #\0 precedes #\1, which precedes #\2, and so on. But if you compare a digit with a letter, or anything with a punctuation mark, the results depend on the character set.


Exercise 3

Do the digit characters precede or follow the capital letters in the ASCII collating sequence?


Exercise 4

If you were designing a character set, where in the collating sequence would you place the space character? Why? What position does it occupy in ASCII?


Because there are many applications in which it is helpful to ignore the distinction between a capital letter and its lower-case equivalent in comparisons, Scheme also provides case-insensitive versions of the comparison procedures: char-ci<?, char-ci<=?, char-ci=?, char-ci>=?, and char-ci>?), which essentially convert all letters to the same case (in Chez Scheme, lower case) before comparing them.

Scheme provides several predicates that apply to characters:


Exercise 5

In ASCII, the collating-sequence numbers of the control characters are 0 through 31 and 127. Define a predicate char-control? that returns #t if its argument is a control character, #f otherwise.


Finally, there are two procedures for converting letters automatically from one case to the other:


This document is available on the World Wide Web as

http://www.math.grin.edu/courses/Scheme/spring-1998/characters.html

created October 8, 1997
last revised June 21, 1998

John David Stone (stone@math.grin.edu)