Class Project: Exploring Data

Proposal Due
Monday, Nov 19, 2018 by 10:30pm
Implementation Due
Tuesday, Dec 4, 2018 by 10:30pm
Presentations
Wednesday, Dec 12, 2018 in class
Summary
At this point in your career, you’ve learned a number of techniques for gathering and displaying data. This project is an opportunity for you to explore some techniques in greater depth.
Purposes
To explore some aspect of data science in depth. To emphasize the more creative components of this course. To encourage more purposeful reflection on algorithms.
Collaboration
We encourage you to work in groups of size two. You may, however, work alone or work in a group of size three. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your submissions to csc151-01-grader@grinnell.edu. The subject of your email should include [CSC 151] Project along with a list of all authors of the project. Your email should also include all appropriate attachments (e.g. your project proposal or all the required files for your final project submission).

Some Resources

  • The project grading rubric tells you a bit more about what we expect you to do in this project and how we will assess your work.
  • A sample project or two may appear in the near future.
  • Kaggle has some interesting open datasets available.

Warning: So that this assignment is a learning experience for everyone, we will almost certainly spend class time publicly critiquing your work.

Assignment

Background: Specification

In this project, your group will identify a data set of interest, design and implement a nontrivial algorithm or algorithms for manipulating the data set and extracting information, run the algorithm, and present the results of the algorithm. The primary intents are that you demonstrate your mastery of the concepts and skills introduced in the class in a novel way.

General expectations

Reasonable size: Your project should be of a scope that it can be completed by your group with approximately ten hours of work per team member over a two-week period (five hours per week).

Moderately large data: You should identify a moderately large data set, preferably with a few thousand or more data points or with many columns of data to work with. Different problems may require different kinds or sizes of data. For example, if you are working with a group of literary works, you may decide to use only a few dozen, since each work has hundreds of thousands of words.

Nontrivial, novel algorithm: Your project should demonstrate the group’s ability to design and implement a nontrivial algorithm that differs reasonably from any of the algorithms we have already defined. You might combine ideas we have discussed in new ways or you may develop a completely new algorithm.

Alternative outcomes: As you’ve likely discovered, we tend to underestimate how much time it takes to complete a computer project. Hence, your project should have three targets: (a) an intended outcome - what you expect to be able to achieve; (b) a satisficing outcome- something not as complex as the intended outcome, but complex enough that it meets the general expectations for the project; and (c) a reach outcome - something that you can try to achieve if the intended goal is more straightforward than you expected.

For example, if the project was to use part of the Gutenberg dataset to develop an algorithm that can distinguish between two 18th century and 19th century British literature, your satisficing goal might be to be able to extract a set of a dozen characteristics from any work that you could then compare to another work and your reach goal might be able to take any two sets of literature and generate a process that distinguishes between them.

Sample categories

“Traditional data analysis”

Convert a data set from some format to CSV. Clean the data. Compute interesting summary statistics (e.g., tallies of different subgroups). Visualize those summaries. The focus of this kind of project is primarily the complexity of the data and the kinds of ways you process more complex data. A creative visualization might also serve as the focus.

Perceptron learning

Given a training set of data, with each element of the data a vector of numbers classified in one of two sets, generate a vector of weights so that when the weights are multiplied by the vector, elements in one set have a low total score and elements in the other set have a high total score. Your technique will involve repeated refinement to the weight set. (You can ask your instructors or Google how to refine the weights.)

Text generation

After gathering information from a group of texts (e.g., individual word frequency, likelihood of one word following another word (or another sequence of words)), probabilistically generate moderately clear text. The focus of this kind of project is likely the algorithm for successfully gathering data and using that data.

Text identification

Given two sets of written works (e.g., books by Bronte and Doyle; books from different centuries; books from different genres), identify some distinguishing characteristics (e.g., number of distinct words, sentence length, common or uncommon words) and use those characteristics to classify new works as belonging to one set or the other.

Data analysis tools for a particular data set

Given a particular data set, develop tools that someone in the first week or two of CSC 151 could use to explore the data set. You might, for example, allow someone to gather trend data from a larger data set or combine data in new ways.

Part One: Proposal

Your project proposal describes the core aspects of your project:

  • The general theme of the project. “We are writing a program to distinguish the works of Bronte and Doyle.” “We are writing a program that identifies trends in Twitter posts based on geographical location.” “We are writing a program that visualizes change in income disparity across the past three decades.”
  • The data set or sets with which you are working. You should indicate where the data come from and describe what one “row” of data looks like. If you expect to need to massage or clean the data before processing it, you might explain what transformations you expect to need to do. (You may also have completed those transformations by the time you submit the proposal.)
  • A high-level overview of the algorithm or algorithms you intend to implement.
  • A short description of the preferred outcome, satisficing outcome, and reach outcome

Your proposal should employ correct grammar and spelling. Approximately one or two pages should suffice.

We will do our best to respond to your proposal in a timely manner. However, given other constraints, we may not be able to do so.

Part Two: Project

After finishing your proposal, you should set to work on implementing your design, making sure to meet all of the specifications outlined above.

Your final project should be accompanied by a short report, most likely based on your proposal. Describe the primary goal of the project, the data, and the algorithm you implemented. Conclude with a short description of what your analysis showed (e.g., “the novels of Bronte and Doyle can easily be distinguished based on …” or “by using the technique of … our algorithm is able to correctly classify cancer cells 95% of the time” or “Throughout the past two decades, naming practices in the US have evolved from dominance by a few names to a more equal distribution amongst a much wider set of names”)

Your final project should also be accompanied by a set of straightforward instructions for running the code.

Part Three: Presentation

You will give a quick (3 min) presentation on your project to your peers during the last week of class.

Submitting the Project

Part One

Copy your proposal into the body of an email message and send it to csc151-01-grader@grinnell.edu.

Part Two

Email your program, data set, and project report to to csc151-01-grader@grinnell.edu.

Questions

Can we reuse code from the assignments and labs?
You may certainly reuse code from the assignments and labs, provided you cite that code. However, you should make sure that your project goes beyond what you did for the assignment or lab. Hence, you will likely want to extend or otherwise rewrite that code. (Even if you extend or rewrite code, you should still cite its origin and influence.)