Mini Lab: Debugging

Assigned:
Monday, Feb 5, 2018
Due:
You do not need to submit any work for this mini lab.
Collaboration:
You can work in groups of two or three students to complete this in-class lab.

Preparation

This exercise uses a collection of buggy programs I have prepared for you to practice using gdb. Before starting the lab, make a copy of the starter programs with the following command:

$ cp -R /home/curtsinger/Classes/2018S/csc213/gdb-practice gdb-practice

All commands in the remainder of this mini-lab should be run inside your copy of the gdb-practice directory.

Part A: Catching Segfaults

We’ll start out by looking at our first buggy program, partA. While the source code is in the directory you copied, this exercise will walk you through a debugging session without the source code. Instead, we’ll rely on gdb to show us lines of code where errors occurred.

Run this program outside of gdb to verify that it does indeed have a bug:

$ ./partA
Segmentation fault

A great next step to debug this program is to start it in gdb:

$ gdb ./partA
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
...
Reading symbols from ./partA...done.
(gdb) 

To run the program in gdb, enter the command run and hit enter. This time, you should end up with output that looks something like this:

Starting program: /home/awesomestudent/gdb-practice/partA 

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400691 in total_characters (words=0x600b40 <arr>, num_words=26) at partA-rest.c:9
9	    while(words[i]->word[j++] != '\0') count += words[i]->count;

In this run of the program, gdb is telling us that a segmentation fault happened inside of the total_characters function on line 9 of a source file named partA-rest.c. Normally you will have access to the source code for the programs you are debugging, but often times different parts of your program (such as libraries) will not have debug information. To replicate that environment, this program has incomplete debug information. We can see that by running the backtrace (or bt) command:

(gdb) backtrace
#0  0x0000000000400691 in total_characters (words=0x600b40 <arr>, num_words=26)
    at partA-rest.c:9
#1  0x000000000040061c in main ()

This shows us the line where our segmentation fault occurred, and also tells us that this function was called from the main function. If we had debug information for main we would see line numbers here as well, but this program does not have debugging information for main so all we get is the symbol name and its address (0x000000000040061c).

I typically begin debugging segmentation faults or other types of errors that stop the program with two questions:

  1. What parts of the current line could have triggered the failure?
  2. How did we get to this error?

We can answer the second answer using backtrace, but you will have to rely on your C knowledge to answer the first question. We need to look at the source line where the error occurred, which may be off the screen at this point. To bring it back, use the frame command:

(gdb) frame
#0  0x0000000000400691 in total_characters (words=0x600b40 <arr>, num_words=26)
    at partA-rest.c:9
9	    while(words[i]->word[j++] != '\0') count += words[i]->count;

Our program crashed with a segmentation fault, which occurrs when you dereference in invalid pointer. The pointer may be NULL, or it could have held some other invalid memory location. Work with your partner to come up with a list of all the parts of this line that dereference a pointer. Once you have a list, move on to the next step

Hunting for invalid pointers

Once you have a list of operations that dereference pointers, you can use gdb to look at the pointer values to see if any of them are suspicious. One possible operation that dereferences a pointer is words[i]. If the words pointer is not valid, indexing into it as an array would trigger a segmentation fault. Use the print command to look at this value:

(gdb) print words
$1 = (word_count_t **) 0x600b40 <arr>

This shows us that words has type word_count_t**, and its value is 0x600b40. We can tell right away that words is not NULL (NULL is zero on most reasonable machines), but is 0x600b40 a valid pointer? You will gradually develop a sense of what a real pointer looks like, but you can check to see if an address is valid using gdb as well. The info proc mappings gdb command can show you all of the valid ranges in your program’s address space. Keep in mind that your output almost certainly will not match the example output below, so be sure to run the command on your own.

(gdb) info proc mappings
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000        0x0 /home/awesomestudent/gdb-practice/partA
            0x600000           0x601000     0x1000        0x0 /home/awesomestudent/gdb-practice/partA
            0x601000           0x622000    0x21000        0x0 [heap]
      0x7ffff7a31000     0x7ffff7bd2000   0x1a1000        0x0 /lib/x86_64-linux-gnu/libc-2.19.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x1a1000 /lib/x86_64-linux-gnu/libc-2.19.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x1a1000 /lib/x86_64-linux-gnu/libc-2.19.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x1a5000 /lib/x86_64-linux-gnu/libc-2.19.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000        0x0 
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000        0x0 /lib/x86_64-linux-gnu/ld-2.19.so
      0x7ffff7fc6000     0x7ffff7fc9000     0x3000        0x0 
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000        0x0 
      0x7ffff7ff8000     0x7ffff7ffa000     0x2000        0x0 [vdso]
      0x7ffff7ffa000     0x7ffff7ffc000     0x2000        0x0 [vvar]
      0x7ffff7ffc000     0x7ffff7ffd000     0x1000    0x20000 /lib/x86_64-linux-gnu/ld-2.19.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000 /lib/x86_64-linux-gnu/ld-2.19.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000        0x0 
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0 [vsyscall]

This shows all of the virtual addresses accessible to this program, each established by the operating system. Most of these were set up via calls to mmap. Note that some mappings are placed at random locations, so your addresses may not match up exactly. If you look through the entries, you’ll see that words has a value that falls between the start and end addresses of the second entry. This entry is connected to the main program we are running, so this must be a global or static variable somewhere in the program.

Continue printing values of variables used on the current line until you have identified the offending pointer.

Using this information

Now that you’ve discovered the offending pointer, the next step is to examine the code of the main function to figure out why it is calling total_characters with an array that contains an invalid pointer. You don’t have to fix this error, but if this was your program then returning to the calling context would potentially help you figure out what is going on. As you’ll see in the next part, sometimes finding the corrupted value is just the first step in a longer debugging process.

Part B: Diagnosing Mysterious Bugs

For this part, we will look at a short program with exactly one memory error. Unlike in the previous example, you will have the complete program source code available for your use. Here it is:

This is a pretty straightforward program that copies one array to another. If you see the error already, congratulations! But, for the purposes of this exercise, try not to use that information as you work through the following steps.

First, we’ll run the program without gdb:

$ ./partB
I've made a huge mistake.

Unlike our first example, this program does not stop at the point of an error. Instead, we just get the wrong result. Still, we can use gdb to track down the root cause of the error. Start the program with gdb:

$ gdb ./partB
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
...
Reading symbols from ./partB...done.
(gdb) run
Starting program: /home/awesomestudent/gdb-practice/partB 
I've made a huge mistake.
[Inferior 1 (process 3102) exited normally]

We’re still getting the wrong answer, so we can work backwards through the program. If the sums of the two arrays are not equal, we could check to see what the values of those sums are. A breakpoint is a reasonable way to do this. The program has computed the sums by line 28, so we’ll set a breakpoint, run the program again, and then print both sums with gdb:

(gdb) break partB.c:28
Breakpoint 1 at 0x4005c5: file partB.c, line 28.

(gdb) run
Starting program: /home/awesomestudent/gdb-practice/partB

Breakpoint 1, main () at partB.c:28
28	  if(array1_sum == array2_sum) {
  
(gdb) print array1_sum
$1 = -140137660

(gdb) print array2_sum
$2 = 15

It looks like array2_sum is computed correctly, but array1_sum is not. That’s odd, because we are copying values from array1 to array2, and yet somehow array1 is being overwritten. This is evidence of a buffer overrun, which can be difficult to track down. However, gdb gives us the tools we need to catch this buffer overrun as it occurs. There are two possibilities that will make sense in different circumstances, but we’ll track down the error with both.

Catching the error with conditional breakpoints

We know that the values of array1 are being overwritten by some code in our program. Because we have a small program, we can actually narrow this down pretty easily; the only code that writes to memory in our program is the loop on lines 18–20. This loop just runs a few times, so we could set a breakpoint on each iteration of the loop and inspect the result:

(gdb) break partB.c:19
Breakpoint 2 at 0x40057f: file partB.c, line 19.

(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /home/awesomestudent/gdb-practice/partB 

Breakpoint 2, main () at partB.c:19
19	    array2[i] = array1[i];

We’re now at the first write to memory. Because we suspect a buffer overrun, we should make sure our array indices are in-bounds. You can print i every time, or you can use gdb’s display command to print the value of a variable each time the program stops.

(gdb) display i
1: i = 0
(gdb) continue
Continuing.

Breakpoint 2, main () at partB.c:19
19	    array2[i] = array1[i];
1: i = 1

Now each time we continue, the program will stop at our breakpoint and print i. If you want to run the same command repeatedly, just hit Enter in gdb. This process will take us to the error eventually, and in our case after just a few iterations. However, this does not work well if your code loops thousands of times before an error occurs. For this, we can use conditional breakpoints.

First, remove all breakpoints from your program:

(gdb) delete
Delete all breakpoints? (y or n) y

Now, we’ll set a conditional breakpoint. We aren’t concerned about in-bounds writes to memory, but we do want to catch the first out of bounds write. That occurs for indices greater than or equal to 5, the length of array2.

(gdb) break partB.c:19 if i >= 5
Breakpoint 3 at 0x40057f: file partB.c, line 19.

(gdb) run
Starting program: /home/awesomestudent/gdb-practice/partB 

Breakpoint 3, main () at partB.c:19
19	    array2[i] = array1[i];
1: i = 5

Now we’ve stopped the program at exactly the point where an out-of-bounds write occurrs. Given this information, you can go back to the code and figure out why this loop is running for too many iterations. This approach works well when you have a good idea of where a buffer overrun is occurring and you want to catch it “in the act.” However, it’s not always clear which code you should be checking. That’s where watchpoints are useful.

Catching the error with watchpoints

In this case, we’re going to ignore the code and instead watch for modifications to memory. First, delete our breakpoints, set a new breakpoint after we’ve added up the arrays, and run the program.

(gdb) delete
Delete all breakpoints? (y or n) y

(gdb) break partB.c:28
Breakpoint 4 at 0x4005c5: file partB.c, line 28.

(gdb) run
Starting program: /home/awesomestudent/gdb-practice/partB 

Breakpoint 4, main () at partB.c:28
28	  if(array1_sum == array2_sum) {

Now we’re at the point where we’ve computed invalid array sums. Instead of looking at the sums themselves, we’ll look inside the arrays. The two arrays should be equal, so we’ll look for indices that are not identical:

(gdb) print array1[0]
$4 = 0
(gdb) print array2[0]
$5 = 1

That didn’t take long. If you remember from earlier, we discovered that the sum of array 1 was incorrect. Somehow, array1[0] is being overwritten. To catch this overwriting, we’ll delete our breakpoints and start the program again. The start command will begin executing the program and stop once we reach main.

(gdb) delete
Delete all breakpoints? (y or n) y

(gdb) start
Temporary breakpoint 5 at 0x400553: file partB.c, line 14.
Starting program: /home/awesomestudent/gdb-practice/partB 

Temporary breakpoint 5, main () at partB.c:14
14	  int array1[] = {1, 2, 3, 4, 5};

Now that the program has started we can set a watchpoint to monitor array1[0] for changes. If we tried to do this before starting the program we may have the wrong address; many parts of the program are loaded at random addresses on each run, so we need to make sure we get addresses from the current run.

(gdb) watch array1[0]
Hardware watchpoint 6: array1[0]

(gdb) continue
Continuing.
Hardware watchpoint 6: array1[0]

Old value = 4195824
New value = 1
0x000000000040055a in main () at partB.c:14
14	  int array1[] = {1, 2, 3, 4, 5};

We’ve now stopped our program at the first modification to array1[0]. This is actually initializing the array, so we haven’t found the write we’re hunting for.

(gdb) continue
Continuing.
Hardware watchpoint 6: array1[0]

Old value = 1
New value = 0
main () at partB.c:18
18	  for(int i=0; i<sizeof(array1); i++) {
1: i = 8

Now we’ve stopped the program at the point where array1[0] is overwritten. This brings us to the same point we reached with conditional breakpoints, but we did not need to know which code was overwriting the array contents. In general, watchpoints are useful when you know a value is being changed but you don’t know why. I recommend using these over conditional breakpoints in most cases, but they have some limitations. You are limited to just four watchpoints at a time, and watchpoint can only detect modifications to a range of 1, 2, 4, or 8 bytes, not an entire array or a large struct.

Wrapping Up

Now that you’ve tracked down the problem with partB.c, make sure you know how to fix it. This case is somewhat contrived, but hopefully you can take some of these techniques and use them for debugging your own programs in the future. There are many more gdb commands, so I recommend running the help command to see what commands are available. The gdb command line also includes Tab completion, so you can auto-complete many commands if you remmeber how they start. If you learn any new, useful gdb commands, please share them with the class!

There are quite a few gdb commands we did not use in this example. Two notable examples are step and next, which allow you to walk through your program one line at a time. In general, you want to avoid doing this; use breakpoints and watchpoints to stop the program at the point you want instead of running the program one line at a time. Sometimes there are cases where you have no choice but to step through a program one line at a time. The next command will go to the next line of the current function. If the current line calls another function, gdb will execute that function and break when it returns. The step command will run until the next source line, whether it is in the same function or not. These are equivalent to the “step over” and “step into” operations that many graphical debuggers support.