# Lab: GPU Galaxies

Assigned:
Tuesday, Apr 10, 2018
Due:
Friday, Apr 20, 2018 by 10:30pm
Collaboration:
Work with your assigned partner for this lab. You can use online resources for your lab, provided they do not provide complete answers and you cite them in your code. If you do not know whether it is acceptable use a specific resource you can always ask me.

## Overview

For this lab, you will parallelize a simple n-body simulation using the GPU. An n-body simulation models the movement of n massive objects acting on each other gravitationally. Every pair of stars has some gravitational effect, and to compute the total effect on each star we have to consider all pairs of stars. This results in a lot of repeated computation, which is a good target for parallelization, especially for GPUs where you perform the same task for a large number of input values.

The provided code runs in a single thread on the CPU. Your task is to modify the system to perform the physics calculations on the GPU instead of the CPU, using a single thread for each “star” in the simulation.

## Groups

Group information is no longer available for this course.

Check back for Q&A on this lab.

## Part A: Getting Started

Before you start changing things make sure you can run the program. Once the window appears, click a few times to see what this program does. You should also review the code in main.c. You are welcome to look in gui.c and gui.h as well, but these files should not need to change in your CUDA implementation.

The first step in porting this code to run the physics calculations on the GPU is to convert it to a CUDA program. This should be fairly easy; rename the two .c files to .cu files and update the Makefile to build with nvcc instead of clang.

Move your .c files to .cu files with git mv instead of just mv so they are still tracked by git. It’s okay if you forget to do this; in that case, use git add to add the .cu files to your repository.

You may want to refer back to the mini-labs on the GPU to see how to invoke nvcc with all the relevant options. Make sure you leave the \$(shell sdl2-config ...) lines in the settings for CFLAGS and LDFLAGS; these set up the required options for SDL2, the library this program uses to display a window and draw graphics.

Once you can build the program with nvcc, make sure it runs correctly. You’re still not doing anything on the GPU, but you at least have a solid starting point to begin moving calculations over to the GPU.

Commit your changes, push them to GitHub, and have the instructor or a mentor sign off on this part before you move on. We will check to make sure your .cu files are checked into the GitHub repository and you’ve used all the required nvcc options in your updated Makefile.

## Part B: Data Movement

The important physics calculations in this program begin on line 90 of main.c. This code loops over every star in the system and performs the following tasks:

1. Move the star along the x dimension based on its x velocity and the time step size, DT (line 91)
2. Move the star along the y dimension (line 92)

Then, there is an inner loop that loops over each star to perform the following tasks:

1. Check if the two stars indicated by indices i and j are the same. If so, skip this iteration (line 97)
2. Compute the distance between the two stars in both dimensions (lines 100 and 101)
3. Compute the magnitude of the distance (line 104)
4. Normalize the x and y distances so we have a unit vector (lines 107 and 108)
5. If the distance is very small, set a minimum for our force calculation (lines 111–114)
6. Compute the acceleration due to the force between the stars i and j in both dimensions (lines 117 and 118)
7. Update the star velocity in both dimensions using the acceleration values and our timestep, DT (lines 121 and 122)
8. If the star is at the edge of the screen, bounce it back with half the velocity (lines 125–128)

All of these steps rely on the data in the stars array. Add code to create a GPU version of this array and use cudaMemcpy to transfer the array to the GPU. Pay close attention to the size of the stars array; this array grows whenever the user clicks to create a new star, so make sure you update the GPU array whenever the size changes. Unfortunately, there does not seem to be a cudaRealloc function, so you will have to duplicate that functionality with cudaFree and cudaMalloc.

Your code should still work after these changes, although you won’t actually be doing any computation on the GPU.

## Part C: Position Updates on the GPU

The easiest code to move to the GPU is the update to each star’s position, originally on lines 91 and 92 of main.c. Remove these two lines from the loop and instead perform this calculation with a GPU kernel. Instead of looping over each star in the stars array, you should have one CUDA thread update each star’s position. Don’t forget to copy the updated star positions back from the GPU after your kernel completes.

Hint: the cudaDeviceSynchronize() function returns cudaSuccess if the kernel worked, and an error code if it failed. You may want to look at the CUDA error checking on this stack overflow page for some good approaches to error handling.

At this point your program should still work correctly. Do not move on until you have resolved any bugs in this stage.

## Part D: Force Calculations on the GPU

Now that you have successfully moved some of the updates to the GPU, move the rest of the physics calculation over to the GPU. Again, don’t forget to copy updated values back from the GPU once your kernel completes.

## Part E: Eliminate Unnecessary Copying

Now that you have a working CUDA port of this simulation, it’s time to think about performance. With GPUs, that usually means thinking about data movement. Odds are you are copying your stars array back and forth more than necessary. Think about when arrays need to be moved from the CPU to the GPU and vice versa; remember that memory on the GPU remains accessible after a kernel finishes, so there’s no need to copy the data to the GPU again if it hasn’t changed on the CPU.

To earn full credit on this lab you must eliminate all unnecessary transfers of the stars array. You do not need to eliminate fields from the stars struct in your copying; I will count the number of cudaMemcpy calls, not the number of bytes transferred.