In today’s mini-lab, we will continue our practice with CUDA for GPU programming. Today’s exercises focus on the types of memory available on a GPU system, and the mechanisms you use to move data between these types of memory. We won’t look at all the available types of memory, but this should give you a good foundation to think about how we deal with separate memory regions and reason about their properties.
Continue working with your groups from the previous GPU mini-lab.
The most basic question you have to ask about a region of memory in a GPU system is “is this memory accessible on the GPU?”
Host memory, which we’ve just called “memory” up to this point in the class, is the usual location for storing variables, code, constants, and any memory returned from
GPUs are separate devices, and generally cannot access host memory;
instead, they have their own large regions of memory called global memory.
If you declare global variables such as
int numbers you are asking for space in host memory to hold 128 integers.
__device__ int numbers requests space on the GPU’s global memory to hold 128 integers.
While this is sometimes useful, we often use global memory to hold dynamically-allocated memory, returned from
malloc returns a pointer to a block of memory usable on the CPU,
cudaMalloc returns a pointer to a block of memory usable on the GPU.
You can use
cudaMemcpy with the
cudaMemcpyHostToDevice option as the fourth parameter to move memory from the host to the device, or pass
cudaMemcpyDeviceToHost as the third parameter to move memory from the GPU back to host memory.
Warning: pointers to GPU memory are not usable on the CPU, and vice versa!
If a pointer was returned from
cudaMalloc, it is only acceptable to dereference or index from that pointer on the GPU.
If a pointer was resturned from
malloc, it is only acceptable to dereference or index from that pointer on the CPU.
You can pass values directly to kernels, but any time you pass a pointer you must be careful to check that the pointer points to GPU memory.
You’ll often have both CPU and GPU versions of arrays you move back and forth, so it’s a good idea to use a naming scheme like
int* cpu_numbers and
int* gpu_numbers to keep track of which is which.
Moving memory back and forth between the host and device is usually the bare minimum required to write functioning GPU code. Complete the exercises below to practice basic data movement.
Write a simple CUDA program to create an array of 1024 integers in the GPU’s global memory using
cudaMalloc. Initialize the array so each index zero holds the value zero, index one holds the value one, index two holds the value two, and so on. Hint: You will need to initialize the values on the CPU first and copy them into GPU memory.
Now write a simple kernel,
doubleValues(int* numbers, int length), that doubles the value at each index of the
numbers array. You should double each individual index in its own thread. Use a block size of 64 threads. Wait for the kernel to compute with the
cudaDeviceSynchronize function, copy the values back from the GPU, and print all values from the CPU.
Data movement is a major part of the time it takes to run programs on the GPU. We can minimize data movement by copying values back and forth only when necessary. Modify your program from exercise A.2 to call the
doubleValues kernel three times (multiplying each index by ). You should be able to do this correctly using only two copies (one to the GPU, adn one back to the host).
GPU memory is often a bit faster than CPU memory, but because of their data-parallel design GPUs need quite a bit more data to keep them busy when compared to a CPU. In addition to a GPU’s global memory, they also have local and shared memory regions that are faster, but limited in how they can be used. Local memory is where a thread’s local variables go, and you generally use it without thinking about it.
Shared memory is more interesting;
when you declare a local variable in your kernel with the
__shared__ type qualifier (e.g.
__shared__ int numbers) you get a region of memory than all threads in the current block can access.
Shared memory is good for two important things.
First, shared memory is much faster than global memory and can be used to hold values you will use repeatedly.
Second, shared memory allows you to perform some operations using intermediate values computed inside a block of threads.
Complete the exercises below to practice using shared memory for both purposes.
Start by writing a simple CUDA program that computes finite differences across a sequence of 128 integers using a block size of 32 threads. This fancy-sounding method just means that you subtract the value at index i from the value at index i+1 to approximate the derivative. Populate the array so it represents the function , or in C,
y[x] = x * x. Keep in mind that your output will have one less element than the input. Run your kernel and verify that the output of the kernel is close to the function (e.g.
y[x] = 2 * x), the derivative of our original function. The slope should be correct, but you may be off by a small constant offset; why might that happen?
The finite differences kernel does not fit neatly into the map and reduce patterns we discussed before the previous lab.
Instead, this is a pattern sometimes called a stencil.
Stencils work much like maps, but the kernel can look anywhere in the neighborhood of the value being mapped rather than just at the one value assigned to the mapper.
That means there will be a lot of accesses to global memory, and neighboring stencil threads will access some of the same values.
We can speed this process up by first copying values into
Add a variable
__shared__ int local_values to your kernel and have the threads in a given block copy values from global memory into this shared array.
You should do the copying in parallel so each thread copies one value, and then choose a thread to copy the 33rd value (if there is one in the source array).
Now modify the rest of your finite differences kernel so it computes differences by reading the shared array instead of global memory.
Verify that your kernel produces the same output.
If you are curious, you can try a larger input and compare running times between the two versions of the kernel;
the one that uses shared memory should be faster for large inputs.
Another use for shared memory is to actually share values between threads in a block.
To prepare for the next exercise, write a kernel that computes the sum of squares of all the values in an array.
The simplest way to do this is to read each index from global memory, square it, and then add it to a single global output value using
atomicAdd to prevent our updates from racing.
Test your kernel on an array of at least 100 integers and verify the output.
Remember that accesses to global memory are slow;
atomic accesses are even slower!
Instead of making one atomic update per index in the input array, we are going to use shared memory to coalesce updates.
Modify your kernel so each thread computes the square of its input value and stores that in a
Then, call the
__syncthreads() function to force all threads in the block to wait (this is a barrier in CUDA).
After that point, have one thread sum the values in the shared array and make a single update to the global sum.
Does this coalesced update have to use
atomicAdd? Why or why not?