CSC302 2007S, Class 35: SQL (1): The Relational Model Admin: * EC for tomorrow's convo * EC for tomorrow's CS extra * I really don't like having to record zeros for reading responses. + Even when the reading is hard to grok. Overview: * Context. * Codd's Contributions. * Why Read This Paper? * Thinking About Relations. * Operations on Relations. /Context: What is a Database?/ * A huge collection of data, organized in some way * Sometimes data is fairly structured: * Every student has an ID number, last name, first name, ... * Sometimes data is unstructured * Database community in 1970 grappling with these issues * Needs to be implemented somehow * And there is a close tie between the way we represent the data and the way we implement the "queries" /Detour: A Problem: Rebuilding Data-l/ * What information do we record per student? * ID number * Name (Last, First, Middle, Suffix, ...) * Primary Major (or "Undeclared") * Secondary Major (or Concentration) * Tertiary Concentration * Graduating year * Campus address * Campus phone * Home address * Home phone * List of classes (including semester, grade received, ...) * Implementation * Build an object (record) for each student * Pick an order (e.g., ID), and store them sequentially by that number. SORTED ARRAY * Problem: If the record sizes are different, we lose the benefit of using arrays (in that we cannot naturally figure out how to jump to a particular item, since we cannot compute the offset). * One solution: Make the record sizes uniform by storing pointers rather than than lists for the fields that require lists. * O(1) operations * Change found record * Find a record by ID number (maybe) * O(logn) operations * Find a record by ID number (maybe) * O(n) operations * Insertion * Remove [If we actually remove students] * Find a record by anything other than ID number * List all CS majors (generally, list students who meet some predicate) * O(m), where m is the number of classes * Print info about a found record * Hash table indexed by some useful value * Expected O(1) operations * Find by that useful value * Remove * Insert * O(n) [or worse] operations * Find by something other than the useful value * Listing by something, including the useful value * 4-2 tree or other BST * So, how do we deal with multiple keys? * Make multiple sorted arrays or hash tables or BSTs of POINTERS to the records In 1970, when this paper was written, this was how most people thought about databases * Issue: Common solutions to common problems * Growth of DBMSs * But low level * Problem: Small changes to data organization have huge impact on code Computers in 1970 * Large metal * Cheapest were things like PDPs (10's of thousands, rather than 100's of thousands) * Very little RAM * DATA ON DISK or TAPE or ... * Tape is really hard to do random access on * All Tape algorithms are therefore close to O(n) :-) Computer Science in 1970 * Computer scientists beginning to understand abstraction /Codd's Paper/ * Thesis: Database design should be based on relations * Because we should separate interface from implementation * Because relations can capture the problems we focus on in databases * Because we can formally discuss the meaning of relations and their operations * Claim: "Future users of large data banks must be protected from having to know how the data is organized in the machine" * Evidence: Tree structures and such don't work well * Additional idea; If you change your database, you'll benefit from my suggestions * Goal: Give mechanism for database design