Assignment 5 – Processing words and phrases

Assigned
Wednesday, Oct 3, 2018
Due
Tuesday, Oct 9, 2018 by 10:30pm
Summary
For this assignment, you will put your data science skills to work by processing string data and files.
Collaboration
You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your answers to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151 01] Assignment 5 and should contain your answers to all parts of the assignment. Please send your scheme code in an attached .rkt file.

Problem 1: Parsing words

Topics: Strings

When processing textual data, it is often useful to extract out the individual words used from a sentence, paragraph, or even an entire transcribed speech. Notice that the file->words procedure does something like this, but it only works on an entire file. If we want to perform a similar operation on a single string, we are stuck.

Write and document a procedure (string->words str) that takes a single string and returns a list of individual words used in str. The procedure should return the words in lowercase as well as remove any punctuation and numbers. For example,

> (string->words "The quick brown fox jumps over the lazy dog.")
'("the" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog")
> (string->words "That's funny. I wasn't aware of that!")
'("thats" "funny" "i" "wasnt" "aware" "of" "that")
> (string->words "Hey! The number 5 times 4 isn't equal to 30!")
'("hey" "the" "number" "times" "isnt" "equal" "to")

Hint: you might find the string-split, string-trim, char-alphabetic?, and string-downcase procedures helpful in this problem.

Note: You need only remove spaces, normal punctuation (periods, commas, colons, semicolons, exclamation points, and question marks), and digits. As the examples suggest, apostrophes are treated slightly differently.

Disclaimer: Subsequent problems depend on your solution to this problem. If you do not finish this problem to your satisfaction, you may use (string-split str) as a satisfactory means of extracting words from strings in the following problems.

Problem 2: Comparing word usage in presidential debates

Topics: Files, Heterogeneous lists and tables, Strings

All presidential debates are transcribed and recorded. For example, Kaggle has a dataset of the presidential debates in 2016. We’ve downloaded the CSV dataset for you and named it us-debates.csv. Make a copy of the file (right-click and “Save As”) and familiarize yourself with the data. You will likely find that the us-debates.csv has some oddities to it. For example, the first row of the file contains header information. Furthermore, the candidates column documents the names of the speaking candidates, but it contains candidates named "QUESTION", "Audience", and "CANDIDATES" as well.

a. Define a variable us-debate-table that contains the parsed table from us-debates.csv, but with the header row removed and rows associated with the "QUESTION", "Audience", and "CANDIDATES" candidates filtered out.

b. Document and write a procedure, (total-words-by-candidate table), that takes a table in the format of us-debate-table and returns a two-column table that contains one row for each candidate. The output table should be of the form:

'((candidate1 total-words1)
  (candidate2 total-words2)
  ...)

where the first column is the name of a candidate (a string) and the second column is the total number of words spoken by that candidate. By “total number”, we mean all words, from all responses, from all debates, in the entire table. The output should also be sorted in descending order by the total words.

Execute total-words-by-candidate on your us-debate-table variable.

c. Document and write a procedure, (average-word-length-by-candidate table), that takes a table in the format of us-debate-table and returns a table of the form:

'((candidate1 avg-word-len1)
  (candidate2 avg-word-len2)
  ...)

where the first column is the name of a candidate and the second column is the average length of words used by that candidate. Again, the output should also be sorted in descending order by the average length of words.

Execute average-word-length-by-candidate on your us-debate-table variable.

Problem 3: Frequently used words and phrases in debates

Topics: Files, Heterogeneous lists and tables, Strings

a. Document and write a procedure, (frequently-used-words table speaker), that takes a table in the format of us-debate-table and a string name of a candidate or moderator and returns a list of word tallies sorted in descending order.

Run frequently-used-words to find the top 5 words used by several candidates and moderators in the us-debate-table file.

b. The usage of short phrases, such as three adjacent words, can be more telling than single word usage. Document and write a procedure, (frequently-used-phrases table speaker), that takes a table in the format of us-debate-table and a string name of a candidate or moderator and returns a list of three-word phrase tallies sorted in descending order. For example,

> (take (frequently-used-phrases us-debate-table "Clinton") 5)
'(("a lot of" 42) ("i want to" 27) ("we have to" 24) ("were going to" 23) ("we need to" 23))
> (take (frequently-used-phrases us-debate-table "Cooper") 5)
'(("you mr trump" 7) ("have two minutes" 6) ("thank you mr" 6) ("you have two" 5) ("allow her to" 5))
> (take (frequently-used-phrases us-debate-table "Trump") 5)
'(("we have to" 50) ("were going to" 39) ("a lot of" 30) ("you look at" 26) ("take a look" 24))

Run frequently-used-phrases to find the top 5 words used by several candidates and moderators in the us-debate-table file.

Problem 4: Categorizing words used in debates

Topics: files, strings, conditionals, tallying

We’ve seen a number of ways to categorize words. They may be short or long. They may start with or contain certain letters. They may contain repeated letters. They may be near other words. They may be common. They may be uncommon.

First, pick and describe between six and ten categories. Then, document and write a procedure, (categorize-word word), that gives the category for a word as a string. You should use “uncategorized” for words that do not fit into your categories. For example,

> (categorize-word "aardvark")
"starts with vowel"
> (categorize-word "jabberwocky")
"Carrollian"
> (categorize-word "defenestrate")
"uncommon"
> (categorize-word "madam")
"palindrome"
> (categorize-word "Grinnell")
"proper-name"
> (categorize-word "elephant")
"starts-with-vowel"
> (categorize-word "me")
"short"
> (categorize-word "hello")
"uncategorized"

If a word falls into multiple categories, you will pick only one.

Next, document and write a procedure, (categorize-debate-words table speaker), that takes a table in the format of us-debate-table and a string name of a candidate or moderator and returns a list of tallies of how many words in each category the speaker used in the table.

Finally, categorize several candidates from the us-debate-table and see whether the categorization tells you anything about how they speak.

Note: You will almost certainly use a conditional to write categorize-word.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).