Hash tables and hash functions

A hash table is a data structure that allows elements to be stored in such a way that they can be retrieved, by key, in a constant amount of time, essentially independent of the number of elements in the table.

The basic idea is a variant of the notion of an array. Given an index into an array, one can recover the element stored at that index in constant time, because one can compute the address of the storage location directly from the origin of the array and the index. Similarly, in a hash table, one uses the key to compute the location in the table at which the desired element should be found.

If the range is possible keys is small, the computation is trivial; one simply uses the keys themselves as array subscripts. But there is a problem if the range of possible keys is vastly larger than the number of elements to be stored. For instance, at Grinnell College, student IDs are nine-digit numbers (that is, in the range from 000000000 to 999999999), while there are only about thirteen hundred students. It would not make sense to allocate an array of one billion storage locations just to permit constant-time access to thirteen hundred records.

So one must interpose some computation between the key and the array subscript -- a computation that is typically encapsulated in a hash function that takes keys as arguments and returns subscripts into some more appropriately sized array as values. To find out where in a hash table the element with a given key is stored, one applies the hash function to the key and uses the result as an index into the array.

Of course, since the hash function maps a gigantic range of possible keys into a much smaller range of array subscripts, it can't be one-to-one. On the contrary, it is inevitable that there will be cases in which the hash function assigns the same array subscript to different keys. When the distinct keys of two elements of the hash table are mapped to the same array subscript, a collision occurs. The implementor of a hash table must provide some mechanism for resolving collisions, that is, for finding an alternative storage location for an element that cannot be stored in the (already occupied) position proposed by the hash function.

There are various mechanisms for resolving collisions. The earliest proposal was to use the array subscript returned by the hash function as the starting point for a linear search for an unused location within the table; as soon as the linear search encounters a position that is not already occupied, the incoming element can be inserted. If the end of the table is encountered before an unused location is reached, the search ``wraps around'' to the beginning of the array and continues from there.

This linear probing strategy, however, does not work well, because the data tend to clump together as the table fills up, leading to long stretches of occupied slots separated by sparsely occupied stretches. A better idea, called secondary hashing, applies another hash function to the key to figure out how many positions in the array to jump over, after finding an occupied position, before trying to insert a new element again.

Still another idea -- perhaps the one most frequently used today -- is to implement the hash table, not as an array of elements, but as an array of lists of elements. The hash function is applied to the key to determine which of these lists the new element should be added to; in the event of a collision, one simply puts all of the elements that hash to the same array subscript into the same list, or bucket, as it is sometimes called. As compared with linear probing and secondary hashing, this method has the advantage that it can if necessary accommodate more elements than there are positions in the array, though with a progressive degradation of performance as the average list grows longer and the linear search down such a list comes to occupy a larger fraction of the running time.

Also, it is far easier to delete an element from a hash table that uses buckets than from one that uses linear probing or secondary hashing as its collision-resolution mechanism. In many applications, however, deletions are never needed, or can be saved up and performed at a time when the hash table must be completely rebuilt anyway.

Here is what the interface for a hash-table data type would look like. In this implementation, the keys are non-negative integers; later on in this handout we'll look at how one might have to modify the hash-table package to accommodate keys of other types.

type
  key_type = integer;
  element = record
              key: key_type;
              { and presumably other fields as well }
            end;
  hash_table = { implementation-dependent };

{ The empty_hash_table function creates and returns a hash table with
  nothing stored in it, ready for insertions. }

function empty_hash_table: hash_table;

{ The is_full_hash_table function determines whether the hash table is
  completely full.  It is important to make this test before attempting
  to do an insertion. }

function is_full_hash_table (t: hash_table): Boolean;

{ The insert_in_hash_table procedure adds a given element to a given
  hash table.  It is an error to invoke this procedure when the hash
  table is full or when another element with the same key is already
  in the hash table. }

procedure insert_in_hash_table (var t: hash_table; entry: element);

{ The search_in_hash_table function looks in a given hash table for
  an element with a specified key, returning TRUE if it finds such an
  element and FALSE if it does not.  In addition, if the search is
  successful, the entire element is returned through the variable
  parameter entry. }

function search_in_hash_table (t: hash_table; sought: key_type;
  var entry: element): Boolean;

{ The deallocate_hash_table procedure disposes of all the storage
  associated with the hash table, leaving its argument undefined. }

procedure deallocate_hash_table (var t: hash_table);
If a bucket implementation is used, of course, the is_full_hash_table function will always return FALSE. In all other cases, the hash table can fill up, and it would be appropriate for the programmer to export a constant that tells how many slots are available:

const
  TABLE_SIZE = 1021;
    { the number of slots in the hash table; the table is full when it
      contains TABLE_SIZE - 1 elements }
For linear probing and secondary hashing, it is important that there be at least one unused storage location in the table at all times; otherwise, a search that is supposed to terminate when it finds such a storage location will run forever.

Also, there must be some way to distinguish an empty storage location from an occupied one. If bucketing is used, an empty storage location will simply be one in which the list of elements is null. Otherwise, we'll store an each location a dummy element with a negative number as its key, so that it can't be mistaken for a valid element.

const
  NULL_KEY = -1;
    { a conventional indication of an unoccupied slot in the hash table;
      the real keys should be non-negative }
When storing records into a hash table, one needs a function that takes the record's key as an argument and returns a value that is an index of the hash table. (When the entries in a hash table are simple values rather than records, each value serves as its own key.) There are two constraints on this function: (1) Since it will be invoked very frequently, it should be simple and fast. (2) Since hash tables work best when the collision rate is low, the hash function should ``randomize'' the keys; in other words, the values it produces should not conform to any pattern that may characterize the keys.

When the keys are integers and their range is many times larger than the range of hash table indices, the best and most commonly used hash function divides each key by the size of the hash table and returns the remainder (adding 1 if the array subscripts start at 1 rather than at 0). In many applications, this method has too high a collision rate if the hash table size has any small divisors, so it is customary to choose a prime number as the size of the hash table. This also simplifies secondary hashing, if it is used, by ensuring that an iterated linear transformation will generate all the hash table indices before repeating.

{ The hash function maps any non-negative integer key into some position
  within the hash table, in a pseudo-random way. }

function hash (key: key_type): position_number;
begin
  assert (0 <= key, NEGATIVE_KEY_EXCEPTION, hash_table_handler);
  hash := key mod TABLE_SIZE + 1
end;
Secondary hashing is usually implemented in such a way that each of the values produced by the second hash function is relatively prime to the size of the table, so that if the second hash function is invoked repeatedly it will produce a succession of array subscripts that does not begin to repeat itself until all of the possible array subscripts have appeared in the sequence. To ensure this property, the implementation presented here requires that both TABLE_SIZE and TABLE_SIZE - 2 be prime numbers. When this is done, the same basic idea can be used to define the secondary hash function:

{ The probe function implements a pseudo-random permutation of the
  positions in the hash table; given any position, it yields another
  position, using a formula that makes the result dependent on the value
  of the key (assumed to be a non-negative integer). }

function probe (position: position_number; sought: key_type):
  position_number;
var
  hashedkey: integer;
    { an independent pseudo-random number derived from the search key }
begin
  assert (0 <= sought, NEGATIVE_KEY_EXCEPTION, hash_table_handler);
  hashedkey := sought mod (TABLE_SIZE - 2) + 1;
  probe := (position + hashedkey) mod TABLE_SIZE + 1
end;
Here is how the functions and procedures described in the interface can be implemented with the help of these hash functions:

type
  position_number = 1 .. TABLE_SIZE;
    { range of position numbers in the hash table }
  load_range = 0 .. TABLE_SIZE - 1;
    { A hash table's ``load'' is the number of elements stored in it;
      this is the range of possible loads in this implementation.  The
      maximum load is TABLE_SIZE - 1 rather than TABLE_SIZE so that there
      is always one empty slot to terminate unsuccessful searches. }
  table = record
            arr: array [position_number] of element;
            load: load_range
          end;
    { The 'load' field keeps track of the number of positions actually in
      use; load = 0 for an empty table, load = TABLE_SIZE - 1 for a full
      one. }
  hash_table = ^table;

{ The found function determines whether an element with a given key is
  present in a given hash table; the variable parameter position is set
  to the position within the table occupied by the element sought, if it
  is present, or to an empty position appropriate for inserting a new
  element with the specified key, if none is present. }

function found (t: hash_table; sought: key_type;
  var position: position_number): Boolean;
var
  looking: Boolean;
    { indicates whether the search is to continue beyond the present
      position } 
begin
  assert (0 <= sought, NEGATIVE_KEY_EXCEPTION, hash_table_handler);
  with t^ do begin
    looking := TRUE;
    position := hash (sought);
    while looking do
      if arr[position].key = sought then begin
        found := TRUE;
        looking := FALSE
      end
      else if arr[position].key = NULL_KEY then begin
        found := FALSE;
        looking := FALSE
      end
      else
        position := probe (position, sought)
  end
end;

function empty_hash_table: hash_table;
var
  result: hash_table;
    { the hash table under construction }
  index: position_number;
    { counts off the positions in the hash table }
begin
  new (result);
  with result^ do begin
    load := 0;
    for index := 1 to TABLE_SIZE do
      arr[index].key := NULL_KEY
  end;
  empty_hash_table := result
end;

function is_full_hash_table (t: hash_table): Boolean;
begin
  is_full_hash_table := (t^.load = TABLE_SIZE - 1)
end;

procedure insert_in_hash_table (var t: hash_table; entry: element);
var
  position: position_number;
      { the position at which the new entry is to be inserted }
begin
  assert (t^.load < TABLE_SIZE - 1, FULL_TABLE_EXCEPTION,
          hash_table_handler);
  assert (not found (t, entry.key, position), DUPLICATE_KEY_EXCEPTION,
          hash_table_handler);
  with t^ do begin
    arr[position] := entry;
    load := load + 1
  end
end;

function search_in_hash_table (t: hash_table; sought: key_type;
  var entry: element): Boolean;
var
  position: position_number;
    { the position at which the item is found }
begin
  assert (0 <= sought, NEGATIVE_KEY_EXCEPTION, hash_table_handler);
  if found (t, sought, position) then begin
      entry := t^.arr[position];
      search_in_hash_table := TRUE
  end
  else
      search_in_hash_table := FALSE
end;

procedure deallocate_hash_table (var t: hash_table);
begin
  dispose (t);
  t := NIL
end;
When the keys are real numbers, or when they are integers in a relatively narrow range, another common hash function involves multiplying by some irrational number, discarding the integer part of the result, and then multiplying by the size of the hash table and discarding the remainder. One frequently chosen irrational is phi, the limit of the ratios between successive Fibonacci numbers, (1 + sqrt(5))/2.

The only other common case is that the key is a character string. A hash function for strings usually works by converting the key into an integer or real value and then applying one of the preceding techniques. Summing the ordinal values of the characters in the string often fails to disperse the keys sufficiently. Here are two better methods: (1) If the number of entries is typically small, use a hash table of size 128, and compute the hash function by treating each character as a string of seven bits, performing a bitwise exclusive-or operation to combine it with a ``running total'' and doing a one-bit circular shift on the result (moving each bit one position leftwards and then removing the leftmost bit and placing it at the right end) after each such operation. (2) Add each character's ordinal value to a running total, multiplying by some constant (perhaps phi) after each addition; discard the integer part of the result, multiply by the size of the hash table, discard the remainder.

Recently, one active area of research in computer science has been devising algorithms to find hash functions that are tailored to specific values, in the sense that among those particular values no collisions whatever will take place. For instance, one might design such a function for Pascal's reserved words and predefined identifiers, so that, when a compiler's symbol table is implemented as a hash table, these frequently occurring strings will not cause unnecessary collisions.


This document is available on the World Wide Web as

http://www.math.grin.edu/~stone/courses/fundamentals/hash-tables.html

created May 5, 1996
last revised May 5, 1996