CSC153 2004S, Class 52: Hash Tables Admin: * Questions on the homework? Overview: * Reminder: Philosophy of hash tables * Designing hash functions * Methods, revisited * Dealing with conflicts /Philosophy/ * Hash tables implement the dictionary/Map ADT * Key operations are set(key,value), get(key) * "Arrays indexed by objects" * Hash tables are implemented by converting objects into integers and using them as array indices * Potential problem: Conflicts in which two different objects have the same integer /Hash Functions/ * Designing good hash functions. * As an example: Strings * Criteria for a good hash function? * Efficient. O(1) to compute. * When two objects are different, it should be "unlikely" that the two objects have the same hash value. (Hmmm ... how do we evaluate this characteristic?) * Given the same object, always returns the same value. * Two equal objects have the same hash values (e.g., a.equals(b) implies a.hashCode() == b.hashCode()) * Unlikely to "miss" any hash values. /Hashing Exercise/ Hash function for strings: Convert each letter to a number, sum the numbers * (mod by the table size to figure out where it goes) A: 1 F: 6 K: 11 P: 16 U: 21 Z: 26 B: 2 G: 7 L: 12 Q: 17 V: 22 C: 3 H: 8 M: 13 R: 18 W: 23 D: 4 I: 9 N: 14 S: 19 X: 24 E: 5 J: 10 O: 15 T: 20 Y: 25 Compute: * The hash value of your first name. * That value mod 16 (big hash table for 8 people) Samuel: 19 + 1 + 13 + 21 + 5 + 12 = 71 ; 7 Erik: 43; 11 Saul: 53; 5 Andrew: 44; 12 Joe: 30; 14 Brian: 44; 12 Ilan: 36; 4 Chase: 36; 4 Observations/Comments: * This worked out particularly badly for our class. * Erik didn't like it to start with. * Any time you just "add up" things, it's more likely you'll get a "middling" sum (e.g., ten-letter names should be close to 135) * Does not cover many of the integers, except for very long strings. * For a really long string, you might overload the integers. (Such strings are unlikely.) (Good computation can help avoid this problem.) * The cost of computing the hash code depends on the length of the string. How do we improve this function? (Or come up with a better one?) * Make each character a digit (or a pair of digits) and string 'em together. * 1 * (char 1) + 32 * (char 2) + 32*32 * char (3) * Use primes to grow quickly * prime^(char 1) + prime^(char 2) + ... * Use only the first N characters/pieces * Rely on other hash functions How would you test a hash function? * Do it a whole lot and look at the distribution. /Methods related to hash tables/ * See java.whatever * What is a load factor? * Observation: Hash tables perform less well as they get more full; When a hash table is sufficiently full, you expand the array and rehash everything; The load factor determines what "sufficiently full" means. * Note: Expansion can slow down the running time of add/set. How do we deal with the possibility that add/set are no longer constant time operations? Claim: The "amortized" running time of set/add is still O(1) provided you don't have too many collisions. /Conflicts/ * Looks up in array by hash value, but only returns if the key matches * Problem (conflict) if two keys have the same "moded" hash value. Two standard techniques for dealing with conflicts between keys * Add another search structure at each entry in the array (list or vector is probably the best bet, since the stuff will be relatively small) * Move to the next "open" position in the hash table and put it there * For put: If the current space is full, keep looking at the next space until nothing is there * For get: If the first space does not match, keep looking at the enxt space until you find the desired key (success) or null (failure) Observe: If the load factor is 1/2, most of the time you'll only look one or two spaces * Removal is a pain The first technique is usually better.