memory leaks in ubuntu: episode II, analysis

In our last exciting episode, we learned how to capture a valgrind log. Today we’re going to take the next step and learn how to actually use it to debug memory leaks.

There are a few prerequisites:

  1. know C. If you don’t know it, go read the C programming language which is often referred to as K&R C. Be sure to understand the sections on pointers, and after you do, come back to my blog. See you in 2 weeks!
  2. a nice supply of your favorite beverages and snacks. I prefer coffee and bacon, myself. Get ready because you’re about to read an epic 2276 word blog entry.

That’s it. Ok, ready? Let’s go!

navigate the valgrind log
Open the valgrind log that you collected. If you don’t have one, you can grab one that I’ve already collected. Take a deep breath. It looks scary but it’s not so bad. I like to skip straight to the good part near the bottom. Search the file for “LEAK SUMMARY”. You’ll see something like:

==13124== LEAK SUMMARY:
==13124==    definitely lost: 916,130 bytes in 37,528 blocks
==13124==    indirectly lost: 531,034 bytes in 12,735 blocks
==13124==      possibly lost: 82,297 bytes in 891 blocks
==13124==    still reachable: 2,578,733 bytes in 42,856 blocks
==13124==         suppressed: 0 bytes in 0 blocks
==13124== Reachable blocks (those to which a pointer was found) are not shown.
==13124== To see them, rerun with: --leak-check=full --show-reachable=yes

You can see that valgrind thinks we’ve definitely leaked some memory. So let’s go figure out what leaked.

Valgrind lists all the leaks, in order from smallest to largest. The leaks are also categorized as “possibly” or “definitely”. We’ll want to focus on “definitely” for now. Right above the summary, you’ll see the worst, definite leak:

==13124== 317,347 (77,312 direct, 240,035 indirect) bytes in 4,832 blocks are definitely lost in loss record 10,353 of 10,353
==13124==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13124==    by 0x74E3A78: g_malloc (gmem.c:159)
==13124==    by 0x74F6CA2: g_slice_alloc (gslice.c:1003)
==13124==    by 0x74F7ABD: g_slist_prepend (gslist.c:265)
==13124==    by 0x4275A4: get_menu_item_for_ap (applet-device-wifi.c:879)
==13124==    by 0x427ACE: wireless_add_menu_item (applet-device-wifi.c:1138)
==13124==    by 0x41815B: nma_menu_show_cb (applet.c:1643)
==13124==    by 0x4189EC: applet_update_indicator_menu (applet.c:2218)
==13124==    by 0x74DDD52: g_main_context_dispatch (gmain.c:2539)
==13124==    by 0x74DE09F: g_main_context_iterate.isra.23 (gmain.c:3146)
==13124==    by 0x74DE499: g_main_loop_run (gmain.c:3340)
==13124==    by 0x414266: main (main.c:106)

Wow, we lost 300K of memory in just a few hours. Now imagine if you don’t reboot your laptop for a week. Yeah, that’s not so good. Time for a coffee and bacon break, the next part is about to get fun.

read the stack trace
What you saw above is a stack trace, and it’s printed chronologically “backwards”. In this example, malloc() was called by g_malloc(), which was called by g_slice_alloc(), which in turn was called by g_slist_prepend(), which itself was called by get_menu_item_for_ap() and so forth. The first function ever called was main(), which should hopefully make sense.

At this point, we need to use a little bit of extra knowledge to understand what is happening. The first function, main() is in our program, nm-applet. That’s fairly easy to understand. However, the next few functions that begin with g_main_ don’t actually live inside nm-applet. They are part of glib, which is a library that nm-applet depends on. I happened to have just known this off the top of my head, but if you’re ever unsure, you can just google for the function name. After searching, we can see that those functions are in glib, and while there is some magic that is happening, we can blissfully ignore it because we see that we soon jump back into nm-applet code, starting with applet_update_indicator_menu().

a quick side note
Many Linux programs will have a stack trace similar to the above. The program starts off in its own main(), but will call various other libraries on your system, such as glib, and then jump back to itself. What’s going on? Well, glib provides a feature known as a “main loop” which is used by the program to look for inputs and events, and then react to them. It’s a common programming paradigm, and rather than have every application in the world write their own main loop, it’s easier if everyone just uses the one provided by glib.

The other observation is to note how the function names appear prominently in the stack trace. Pundits wryly say that naming things is one of the hardest things in computer science, and I completely agree. So take care when naming your functions, because people other than you will definitely see them and need to understand them!

Alright, let’s get back to the stack trace. We can see a few functions that look like they belong to nm-applet, based on their names and their associated filenames. For example, the function wireless_add_menu_item() is in the file applet-device-wifi.c on line 1138. Now you see why we wanted symbols from the last episode. Without the debug symbols, all we would have seen would have been a bunch of useless ??? and we’d be gnashing our teeth and wishing for more bacon right now.

Finally, we see a few more g_* functions, which means we’re back in the memory allocation functions provided by glib. It’s important to understand at this point that g_malloc() is not the memory leak. g_malloc() is simply doing whatever nm-applet asks it to do, which is to allocate memory. The leak is highly likely to be in nm-applet losing a reference to the pointer returned by g_malloc().

What does it mean?
Now we’re ready to start the real debugging. We know approximately where we are leaking memory inside nm-applet: get_menu_item_for_ap() which is the last function before calling the g_* memory functions. Time to top off on coffee because we’re about to get our hands dirty.

reading the source
The whole point of open source is being able to read the source. Are you as excited as I am? I know you are!

First, let’s get the source to nm-applet. Assuming you are using Ubuntu and you are using 12.04, you’d simply say:

$ cd Projects
$ mkdir network-manager-gnome
$ cd network-manager-gnome
$ apt-get source network-manager-gnome
$ cd network-manager-applet-0.9.4.1

Woo hoo! That wasn’t hard, right?

side note #2
Contrary to popular belief, reading code is harder than writing code. When you write code, you are transmitting the thoughts of your messy brain into an editor, and as long as it kinda works, you’re happy. When you read code, now you’re faced with the problem of trying to understand exactly what the previous messy brain wrote down and making sense of it. Depending on how messy that previous brain was, you may have real trouble understanding the code. This is where pencil and paper and plenty of coffee come into play, where you literally trace through what the program is doing to try and understand it.

Luckily there are at least a few tools to help you do this. My favorite tools are cscope and ctags, which help me to rapidly understand the skeleton of a program and navigate around its complex structure.

Assuming you are in the network-manager-applet-0.9.4.1 source tree:

$ apt-get install cscope ctags
$ cscope -bqR
$ ctags -R
$ cscope -dp4


You are now presented with a menu. Use control-n and control-p to navigate input fields at the bottom. Try navigating to “Find this C symbol:” and then type in get_menu_item_for_ap, and press enter. The search results are returned, and you can press ‘0’ or ‘1’ to jump to either of the locations where the function is referenced. You can also press the space bar to see the final search result. Play around with some of the other search types and see what happens. I’ll talk about ctags in a bit.

Alrighty, let’s go looking for our suspicious nm-applet function. Start up cscope as described above. Navigate to “Find this global definition:” and search for get_menu_item_for_ap. cscope should just directly put you in the right spot.

Based on our stack trace, it looks like we’re doing something suspicious on line 879, so let’s go see what it means.

 869         if (dup_data.found) {
 870 #ifndef ENABLE_INDICATOR
 871                 nm_network_menu_item_best_strength (dup_data.found, nm_acce
 872                 nm_network_menu_item_add_dupe (dup_data.found, ap);
 873 #else
 874                 GSList *dupes = NULL;
 875                 const char *path;
 876 
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;
 882         }

Cool, we can now see where the source code is matching up with the valgrind log.

Let’s start doing some analysis. The first thing to note are the #ifdef blocks on lines 870, 873, and 880. You should know that ENABLE_INDICATOR is defined, meaning we do not execute the code in lines 871 and 872. Instead, we do lines 874 to 879, and then we do 881. Why do we do 881 if it is after the #endif? That’s because we fell off the end of the #ifdef block, and then we do whatever is next, after we fall off, namely returning NULL.

Don’t worry, I don’t know what’s going on yet, either. Time for a refill!

Back? Great. Alright, valgrind says that we’re doing something funky with g_slist_prepend().

==13124==    by 0x74F7ABD: g_slist_prepend (gslist.c:265)

And our relevant code is:

 874                 GSList *dupes = NULL;
 875                 const char *path;
 876 
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;

We can see that we declare the pointer *dupes on line 874, but we don’t do anything with it. Then, we assign something to it on line 877. Then, we assign something to it again on line 879. Finally, we end up not doing anything with *dupes at all, and just return NULL on line 881.

This definitely seems weird and worth a second glance. At this point, I’m asking myself the following questions:

  • did g_object_get_data() allocate memory?
  • did g_slist_prepend() allocate memory?
  • are we overwriting *dupes on line 879? that might be a leak.
  • is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?

Let’s take them in order.

did g_object_get_data() allocate memory?
g_object_get_data has online documentation, so that’s our first stop. The documentation says:

Returns :
the data if found, or NULL if no such data exists. [transfer none]

Since I am not 100% familiar with glib terminology, I guess [transfer none] means that g_object_get_data() doesn’t actually allocate memory on its own. But let’s be 100% sure. Time to grab the glib source and find out for ourselves.

$ apt-get source libglib2.0-0
$ cd glib2.0-2.32.1
$ cscope -bqR
$ ctags -R
$ cscope -dp4
search for global definition of g_object_get_data

Pretty simple function.

3208 gpointer
3209 g_object_get_data (GObject     *object,
3210                    const gchar *key)
3211 {
3212   g_return_val_if_fail (G_IS_OBJECT (object), NULL);
3213   g_return_val_if_fail (key != NULL, NULL);
3214 
3215   return g_datalist_get_data (&object->qdata, key);
3216 }

Except I have no idea what g_datalist_get_data() does. Maybe that guy is allocating memory. Now I’ll use ctags to make my life easier. In vim, put your cursor over the “g” in “g_datalist_get_data” and then press control-]. This will “step into” the function. Magic!

 844 gpointer
 845 g_datalist_get_data (GData       **datalist,
 846                      const gchar *key)
 847 {
 848   gpointer res = NULL; 
 ... 
 856   d = G_DATALIST_GET_POINTER (datalist);
 ...
 859       data = d->data;
 860       data_end = data + d->len;
 861       while (data < data_end)
 862         {
 863           if (strcmp (g_quark_to_string (data->key), key) == 0)
 864             {
 865               res = data->data;
 866               break;
 867             }
 868           data++;
 869         }
 ... 
 874   return res;
 875 }

This is a pretty simple loop, walking through an existing list of pointers which have already been allocated somewhere else, starting on line 861. We do our comparison on line 863, and if we get a match, we assign whatever we found to res on line 865. Note that all we are doing here is a simple assignment. We are not allocating any memory!

Finally, we return our pointer on line 874. Press control-t in vim to pop back to your last location.

Now we know for sure that g_object_get_data() and g_datalist_get_data() do not allocate any memory at all, so there can be no possibility of a leak here. Let’s try the next function.

did g_slist_prepend() allocate memory?
First, read the documentation, which says:

The return value is the new start of the list, which may have changed, so make sure you store the new value.

This probably means it allocates memory for us, but let’s double-check just to be sure. Back to cscope!

 259 GSList*
 260 g_slist_prepend (GSList   *list,
 261                  gpointer  data)
 262 {
 263   GSList *new_list;
 264 
 265   new_list = _g_slist_alloc ();
 266   new_list->data = data;
 267   new_list->next = list;
 268 
 269   return new_list;
 270 }

Ah ha! Look at line 265. We are 100% definitely allocating memory, and returning it on line 269. Things are looking up! Let’s keep going with our questions.

are we overwriting *dupes on line 879? that might be a leak.
Remember:

 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));

We’ve already proven to ourselves that line 877 doesn’t allocate any memory. It just sets dupes to some value. However, on line 879, we do allocate memory. It is equivalent to this code:

  int *dupes;
  dupes = 0x12345678;
  dupes = malloc(128);

So simply setting dupes to the return value of g_object_get_data() and later overwriting it with the return value of malloc() does not inherently cause a leak.

By way of counter-example, the below code is a memory leak:

  int *dupes;
  dupes = malloc(64);
  dupes = malloc(128);    /* leak! */

The above essentially illustrates the scenario I was worried about. I was worried that g_object_get_data() allocated memory, and then g_slist_prepend() also allocated memory which would have been a leak because the first value of dupes got scribbled over by the second value. My worry turned out to be incorrect, but that is the type of detective work you have to think about.

As a clearer example of why the above is a leak, consider the next snippet:

  int *dupes1, *dupes2;
  dupes1 = malloc(64);     /* ok */
  dupes2 = malloc(128);    /* ok */
  dupes1 = dupes2;         /* leak! */

First we allocate dupes1. Then allocate dupes2. Finally, we set dupes1 = dupes2, and now we have a leak. No one knows what the old value of dupes1 was, because we scribbled over it, and it is gone forever.

is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.

Next time, we’ll see how to actually fix the problem.