Some Resources
Warning: So that this assignment is a learning experience for everyone, we will almost certainly spend class time publicly critiquing your work.
In this project, your group will identify a data set of interest, design and implement a nontrivial algorithm or algorithms for manipulating the data set and extracting information, run the algorithm, and present the results of the algorithm. The primary intents are that you demonstrate your mastery of the concepts and skills introduced in the class in a novel way.
Reasonable size: Your project should be of a scope that it can be completed by your group with approximately ten hours of work per team member over a two-week period (five hours per week).
Moderately large data: You should identify a moderately large data set, preferably with a few thousand or more data points or with many columns of data to work with. Different problems may require different kinds or sizes of data. For example, if you are working with a group of literary works, you may decide to use only a few dozen, since each work has hundreds of thousands of words.
Nontrivial, novel algorithm: Your project should demonstrate the group’s ability to design and implement a nontrivial algorithm that differs reasonably from any of the algorithms we have already defined. You might combine ideas we have discussed in new ways or you may develop a completely new algorithm.
Alternative outcomes: As you’ve likely discovered, we tend to underestimate how much time it takes to complete a computer project. Hence, your project should have three targets: (a) an intended outcome - what you expect to be able to achieve; (b) a satisficing outcome- something not as complex as the intended outcome, but complex enough that it meets the general expectations for the project; and (c) a reach outcome - something that you can try to achieve if the intended goal is more straightforward than you expected.
For example, if the project was to use part of the Gutenberg dataset to develop an algorithm that can distinguish between two 18th century and 19th century British literature, your satisficing goal might be to be able to extract a set of a dozen characteristics from any work that you could then compare to another work and your reach goal might be able to take any two sets of literature and generate a process that distinguishes between them.
Convert a data set from some format to CSV. Clean the data. Compute interesting summary statistics (e.g., tallies of different subgroups). Visualize those summaries. The focus of this kind of project is primarily the complexity of the data and the kinds of ways you process more complex data. A creative visualization might also serve as the focus.
Given a training set of data, with each element of the data a vector of numbers classified in one of two sets, generate a vector of weights so that when the weights are multiplied by the vector, elements in one set have a low total score and elements in the other set have a high total score. Your technique will involve repeated refinement to the weight set. (You can ask your instructors or Google how to refine the weights.)
After gathering information from a group of texts (e.g., individual word frequency, likelihood of one word following another word (or another sequence of words)), probabilistically generate moderately clear text. The focus of this kind of project is likely the algorithm for successfully gathering data and using that data.
Given two sets of written works (e.g., books by Bronte and Doyle; books from different centuries; books from different genres), identify some distinguishing characteristics (e.g., number of distinct words, sentence length, common or uncommon words) and use those characteristics to classify new works as belonging to one set or the other.
Given a particular data set, develop tools that someone in the first week or two of CSC 151 could use to explore the data set. You might, for example, allow someone to gather trend data from a larger data set or combine data in new ways.
Your project proposal describes the core aspects of your project:
Your proposal should employ correct grammar and spelling. Approximately one or two pages should suffice.
We will do our best to respond to your proposal in a timely manner. However, given other constraints, we may not be able to do so.
After finishing your proposal, you should set to work on implementing your design, making sure to meet all of the specifications outlined above.
Your final project should be accompanied by a short report, most likely based on your proposal. Describe the primary goal of the project, the data, and the algorithm you implemented. Conclude with a short description of what your analysis showed (e.g., “the novels of Bronte and Doyle can easily be distinguished based on …” or “by using the technique of … our algorithm is able to correctly classify cancer cells 95% of the time” or “Throughout the past two decades, naming practices in the US have evolved from dominance by a few names to a more equal distribution amongst a much wider set of names”)
Your final project should also be accompanied by a set of straightforward instructions for running the code.
You will give a quick (3 min) presentation on your project to your peers during the last week of class.
Copy your proposal into the body of an email message and send it to csc151-01-grader@grinnell.edu.
Email your program, data set, and project report to to csc151-01-grader@grinnell.edu.