memory leaks in ubuntu: episode III, fixing

Appy polly loggies for the super long delay between episodes, but I finally carved out some time for our exciting dénouement in the memory leak detection series. Past episodes included detection and analysis.

As a gentle reminder, during analysis, we saw the following block of code:

 874                 GSList *dupes = NULL;
 875                 const char *path;
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;

And we concluded with:

Is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.

But there’s a twist! Eagle-eyed reader Dave Jackson (a former colleague of mine from HP, natch) spotted a second leak! It turns out that line 879 was exceptionally leaky during its inception. As Dave points out, the call to g_slist_prepend() passes g_strdup() as an argument. And as the documentation says:

Duplicates a string. If str is NULL it returns NULL. The returned string should be freed with g_free() when no longer needed.

In memory-managed languages like python, the above idiom of passing a function as an argument to another function is quite common. However, one needs to be more careful about doing so in C and C++, taking great care to observe if your function-as-argument allocates memory and returns it. There is no mechanism in the language itself to automatically free memory in the above situation, and thus the call to g_strdup() seems like it also leaks memory. Yowza!

So, what to do about it?

The basic goal here is that we don’t want to throw dupes away. We need to actually do something with it. Here again are the 3 most pertinent lines.

 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 881                 return NULL;

Let’s break these lines down.

  1. On line 877, we retrieve the dupes list from the dup_data.found object
  2. Line 878 gets a path to the duplicate wifi access point
  3. Finally, line 879 adds the duplicate access point to the old dupes list
  4. Line 881 throws it all away!

To me, the obvious thing to do is to change the code between lines 879 and 881, so that after we modify the duplicates list, we save it back into the dup_data object. That way, the next time around, the list stored inside of dup_data will have our updated list. Makes sense, right?

As long as you agree with me conceptually (and I hope you do), I’m going to take a quick shortcut and show you the end result of how to store the new list back into the dup_data object. The reason for the shortcut is that we are now deep in the details of how to program using the glib API, and like many powerful APIs, the key is to know which functions are necessary to accomplish your goal. Since this is a memory leak tutorial and not a glib API tutorial, just trust me that the patch hunk will properly store the dupes list back into the dup_data object. And if it’s confusing, as always, read the documentation for g_object_steal_data and g_object_set_data_full.

@@ -706,14 +706,15 @@
 +		GSList *dupes = NULL;
 +		const char *path;
-+		dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
++		dupes = g_object_steal_data (G_OBJECT (dup_data.found), "dupes");
 +		path = nm_object_get_path (NM_OBJECT (ap));
 +		dupes = g_slist_prepend (dupes, g_strdup (path));
++		g_object_set_data_full (G_OBJECT (dup_data.found), "dupes", (gpointer) dupes, (GDestroyNotify) clear_dupes_list);
  		return NULL;

If the above patch format looks funny to you, it’s because we are changing a patch.

-+		dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
++		dupes = g_object_steal_data (G_OBJECT (dup_data.found), "dupes");

This means the old patch had the line calling g_object_get_data() and the refreshed patch now calls g_object_steal_data() instead. Likewise…

++		g_object_set_data_full (G_OBJECT (dup_data.found), "dupes", (gpointer) dupes, (GDestroyNotify) clear_dupes_list);

The above call to g_object_set_data_full is a brand new line in the new and improved patch.

Totally clear, right? Don’t worry, the more sitting and contemplating of the above you do, the fuller and more awesomer your neckbeard grows. Don’t forget to check it every once in a while for small woodland creatures who may have taken up residence there.

And thus concludes our series on how to detect, analyze, and fix memory leaks. All good? Good.


I can hear the observant readers out there already frantically scratching their necks and getting ready to point out the mistake I made. After all, our newly refreshed patch still contains this line:

 +		dupes = g_slist_prepend (dupes, g_strdup (path));

And as we determined earlier, that’s our incepted memory leak, right? RIGHT!?‽

Not so fast. Take a look at the new line in our updated patch:

++		g_object_set_data_full (G_OBJECT (dup_data.found), "dupes", (gpointer) dupes, (GDestroyNotify) clear_dupes_list);

See that? The last argument to g_object_set_data_full() looks quite interesting indeed. It is in fact, a cleanup function named clear_dupes_list(), which according to the documentation, will be called

when the association is destroyed, either by setting it to a different value or when the object is destroyed.

In other words, when we are ready to get rid of the dup_data.found object, as part of cleaning up that object, we’ll call the clear_dupes_list() function. And what does clear_dupes_list() do, praytell? Why, let me show you!

static void
clear_dupes_list (GSList *list)
	g_slist_foreach (list, (GFunc) g_free, NULL);
	g_slist_free (list);

Trés interesante! You can see that we iterate across the dupes list, and call g_free on each of the strings we did a g_strdup() on before. So there wasn’t an inception leak after all. Tricky tricky.

A quick digression is warranted here. Contrary to popular belief, it is possible to write object oriented code in plain old C, with inheritance, method overrides, and even some level of “automatic” memory management. You don’t need to use C++ or python or whatever the web programmers are using these days. It’s just that in C, you build the OO features you want yourself, using primitives such as structs and function pointers and smart interface design.

Notice above we have specified that whenever the dup_data object is destroyed, it will free the memory that was stuffed into it. Yes, we had to specify the cleanup function manually, but we are thinking of our data structures in terms of objects.

In fact, the fancy features of many dynamic languages are implemented just this way, with the language keeping track of your objects for you, allocating them when you need, and freeing them when you’re done with them.

Because at the end of the day, it is decidedly not turtles all the way down to the CPU. When you touch memory in in python or ruby or javascript, I guarantee that something is doing the bookkeeping on your memory, and since CPUs only understand assembly language, and C is really just pretty assembly, you now have a decent idea of how those fancy languages actually manage memory on your behalf.

And finally now that you’ve seen just how tedious and verbose it is to track all this memory, it should no longer be a surprise to you that most fancy languages are slower than C. Paperwork. It’s always paperwork.

And here we come to the upshot, which is, tracking down memory leaks can be slow and time consuming and trickier than first imagined (sorry for the early head fake). But with the judicious application of science and taking good field notes, it’s ultimately just like putting a delicious pork butt in the slow cooker for 24 hours. Worth the wait, worth the effort, and it has a delicious smoky sweet payoff.

Happy hunting!

kalua pork + homemade mayo and cabbage

memory leaks in ubuntu: episode II, analysis

In our last exciting episode, we learned how to capture a valgrind log. Today we’re going to take the next step and learn how to actually use it to debug memory leaks.

There are a few prerequisites:

  1. know C. If you don’t know it, go read the C programming language which is often referred to as K&R C. Be sure to understand the sections on pointers, and after you do, come back to my blog. See you in 2 weeks!
  2. a nice supply of your favorite beverages and snacks. I prefer coffee and bacon, myself. Get ready because you’re about to read an epic 2276 word blog entry.

That’s it. Ok, ready? Let’s go!

navigate the valgrind log
Open the valgrind log that you collected. If you don’t have one, you can grab one that I’ve already collected. Take a deep breath. It looks scary but it’s not so bad. I like to skip straight to the good part near the bottom. Search the file for “LEAK SUMMARY”. You’ll see something like:

==13124== LEAK SUMMARY:
==13124==    definitely lost: 916,130 bytes in 37,528 blocks
==13124==    indirectly lost: 531,034 bytes in 12,735 blocks
==13124==      possibly lost: 82,297 bytes in 891 blocks
==13124==    still reachable: 2,578,733 bytes in 42,856 blocks
==13124==         suppressed: 0 bytes in 0 blocks
==13124== Reachable blocks (those to which a pointer was found) are not shown.
==13124== To see them, rerun with: --leak-check=full --show-reachable=yes

You can see that valgrind thinks we’ve definitely leaked some memory. So let’s go figure out what leaked.

Valgrind lists all the leaks, in order from smallest to largest. The leaks are also categorized as “possibly” or “definitely”. We’ll want to focus on “definitely” for now. Right above the summary, you’ll see the worst, definite leak:

==13124== 317,347 (77,312 direct, 240,035 indirect) bytes in 4,832 blocks are definitely lost in loss record 10,353 of 10,353
==13124==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/
==13124==    by 0x74E3A78: g_malloc (gmem.c:159)
==13124==    by 0x74F6CA2: g_slice_alloc (gslice.c:1003)
==13124==    by 0x74F7ABD: g_slist_prepend (gslist.c:265)
==13124==    by 0x4275A4: get_menu_item_for_ap (applet-device-wifi.c:879)
==13124==    by 0x427ACE: wireless_add_menu_item (applet-device-wifi.c:1138)
==13124==    by 0x41815B: nma_menu_show_cb (applet.c:1643)
==13124==    by 0x4189EC: applet_update_indicator_menu (applet.c:2218)
==13124==    by 0x74DDD52: g_main_context_dispatch (gmain.c:2539)
==13124==    by 0x74DE09F: g_main_context_iterate.isra.23 (gmain.c:3146)
==13124==    by 0x74DE499: g_main_loop_run (gmain.c:3340)
==13124==    by 0x414266: main (main.c:106)

Wow, we lost 300K of memory in just a few hours. Now imagine if you don’t reboot your laptop for a week. Yeah, that’s not so good. Time for a coffee and bacon break, the next part is about to get fun.

read the stack trace
What you saw above is a stack trace, and it’s printed chronologically “backwards”. In this example, malloc() was called by g_malloc(), which was called by g_slice_alloc(), which in turn was called by g_slist_prepend(), which itself was called by get_menu_item_for_ap() and so forth. The first function ever called was main(), which should hopefully make sense.

At this point, we need to use a little bit of extra knowledge to understand what is happening. The first function, main() is in our program, nm-applet. That’s fairly easy to understand. However, the next few functions that begin with g_main_ don’t actually live inside nm-applet. They are part of glib, which is a library that nm-applet depends on. I happened to have just known this off the top of my head, but if you’re ever unsure, you can just google for the function name. After searching, we can see that those functions are in glib, and while there is some magic that is happening, we can blissfully ignore it because we see that we soon jump back into nm-applet code, starting with applet_update_indicator_menu().

a quick side note
Many Linux programs will have a stack trace similar to the above. The program starts off in its own main(), but will call various other libraries on your system, such as glib, and then jump back to itself. What’s going on? Well, glib provides a feature known as a “main loop” which is used by the program to look for inputs and events, and then react to them. It’s a common programming paradigm, and rather than have every application in the world write their own main loop, it’s easier if everyone just uses the one provided by glib.

The other observation is to note how the function names appear prominently in the stack trace. Pundits wryly say that naming things is one of the hardest things in computer science, and I completely agree. So take care when naming your functions, because people other than you will definitely see them and need to understand them!

Alright, let’s get back to the stack trace. We can see a few functions that look like they belong to nm-applet, based on their names and their associated filenames. For example, the function wireless_add_menu_item() is in the file applet-device-wifi.c on line 1138. Now you see why we wanted symbols from the last episode. Without the debug symbols, all we would have seen would have been a bunch of useless ??? and we’d be gnashing our teeth and wishing for more bacon right now.

Finally, we see a few more g_* functions, which means we’re back in the memory allocation functions provided by glib. It’s important to understand at this point that g_malloc() is not the memory leak. g_malloc() is simply doing whatever nm-applet asks it to do, which is to allocate memory. The leak is highly likely to be in nm-applet losing a reference to the pointer returned by g_malloc().

What does it mean?
Now we’re ready to start the real debugging. We know approximately where we are leaking memory inside nm-applet: get_menu_item_for_ap() which is the last function before calling the g_* memory functions. Time to top off on coffee because we’re about to get our hands dirty.

reading the source
The whole point of open source is being able to read the source. Are you as excited as I am? I know you are!

First, let’s get the source to nm-applet. Assuming you are using Ubuntu and you are using 12.04, you’d simply say:

$ cd Projects
$ mkdir network-manager-gnome
$ cd network-manager-gnome
$ apt-get source network-manager-gnome
$ cd network-manager-applet-

Woo hoo! That wasn’t hard, right?

side note #2
Contrary to popular belief, reading code is harder than writing code. When you write code, you are transmitting the thoughts of your messy brain into an editor, and as long as it kinda works, you’re happy. When you read code, now you’re faced with the problem of trying to understand exactly what the previous messy brain wrote down and making sense of it. Depending on how messy that previous brain was, you may have real trouble understanding the code. This is where pencil and paper and plenty of coffee come into play, where you literally trace through what the program is doing to try and understand it.

Luckily there are at least a few tools to help you do this. My favorite tools are cscope and ctags, which help me to rapidly understand the skeleton of a program and navigate around its complex structure.

Assuming you are in the network-manager-applet- source tree:

$ apt-get install cscope ctags
$ cscope -bqR
$ ctags -R
$ cscope -dp4

You are now presented with a menu. Use control-n and control-p to navigate input fields at the bottom. Try navigating to “Find this C symbol:” and then type in get_menu_item_for_ap, and press enter. The search results are returned, and you can press ‘0’ or ‘1’ to jump to either of the locations where the function is referenced. You can also press the space bar to see the final search result. Play around with some of the other search types and see what happens. I’ll talk about ctags in a bit.

Alrighty, let’s go looking for our suspicious nm-applet function. Start up cscope as described above. Navigate to “Find this global definition:” and search for get_menu_item_for_ap. cscope should just directly put you in the right spot.

Based on our stack trace, it looks like we’re doing something suspicious on line 879, so let’s go see what it means.

 869         if (dup_data.found) {
 871                 nm_network_menu_item_best_strength (dup_data.found, nm_acce
 872                 nm_network_menu_item_add_dupe (dup_data.found, ap);
 873 #else
 874                 GSList *dupes = NULL;
 875                 const char *path;
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;
 882         }

Cool, we can now see where the source code is matching up with the valgrind log.

Let’s start doing some analysis. The first thing to note are the #ifdef blocks on lines 870, 873, and 880. You should know that ENABLE_INDICATOR is defined, meaning we do not execute the code in lines 871 and 872. Instead, we do lines 874 to 879, and then we do 881. Why do we do 881 if it is after the #endif? That’s because we fell off the end of the #ifdef block, and then we do whatever is next, after we fall off, namely returning NULL.

Don’t worry, I don’t know what’s going on yet, either. Time for a refill!

Back? Great. Alright, valgrind says that we’re doing something funky with g_slist_prepend().

==13124==    by 0x74F7ABD: g_slist_prepend (gslist.c:265)

And our relevant code is:

 874                 GSList *dupes = NULL;
 875                 const char *path;
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;

We can see that we declare the pointer *dupes on line 874, but we don’t do anything with it. Then, we assign something to it on line 877. Then, we assign something to it again on line 879. Finally, we end up not doing anything with *dupes at all, and just return NULL on line 881.

This definitely seems weird and worth a second glance. At this point, I’m asking myself the following questions:

  • did g_object_get_data() allocate memory?
  • did g_slist_prepend() allocate memory?
  • are we overwriting *dupes on line 879? that might be a leak.
  • is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?

Let’s take them in order.

did g_object_get_data() allocate memory?
g_object_get_data has online documentation, so that’s our first stop. The documentation says:

Returns :
the data if found, or NULL if no such data exists. [transfer none]

Since I am not 100% familiar with glib terminology, I guess [transfer none] means that g_object_get_data() doesn’t actually allocate memory on its own. But let’s be 100% sure. Time to grab the glib source and find out for ourselves.

$ apt-get source libglib2.0-0
$ cd glib2.0-2.32.1
$ cscope -bqR
$ ctags -R
$ cscope -dp4
search for global definition of g_object_get_data

Pretty simple function.

3208 gpointer
3209 g_object_get_data (GObject     *object,
3210                    const gchar *key)
3211 {
3212   g_return_val_if_fail (G_IS_OBJECT (object), NULL);
3213   g_return_val_if_fail (key != NULL, NULL);
3215   return g_datalist_get_data (&object->qdata, key);
3216 }

Except I have no idea what g_datalist_get_data() does. Maybe that guy is allocating memory. Now I’ll use ctags to make my life easier. In vim, put your cursor over the “g” in “g_datalist_get_data” and then press control-]. This will “step into” the function. Magic!

 844 gpointer
 845 g_datalist_get_data (GData       **datalist,
 846                      const gchar *key)
 847 {
 848   gpointer res = NULL; 
 856   d = G_DATALIST_GET_POINTER (datalist);
 859       data = d->data;
 860       data_end = data + d->len;
 861       while (data < data_end)
 862         {
 863           if (strcmp (g_quark_to_string (data->key), key) == 0)
 864             {
 865               res = data->data;
 866               break;
 867             }
 868           data++;
 869         }
 874   return res;
 875 }

This is a pretty simple loop, walking through an existing list of pointers which have already been allocated somewhere else, starting on line 861. We do our comparison on line 863, and if we get a match, we assign whatever we found to res on line 865. Note that all we are doing here is a simple assignment. We are not allocating any memory!

Finally, we return our pointer on line 874. Press control-t in vim to pop back to your last location.

Now we know for sure that g_object_get_data() and g_datalist_get_data() do not allocate any memory at all, so there can be no possibility of a leak here. Let’s try the next function.

did g_slist_prepend() allocate memory?
First, read the documentation, which says:

The return value is the new start of the list, which may have changed, so make sure you store the new value.

This probably means it allocates memory for us, but let’s double-check just to be sure. Back to cscope!

 259 GSList*
 260 g_slist_prepend (GSList   *list,
 261                  gpointer  data)
 262 {
 263   GSList *new_list;
 265   new_list = _g_slist_alloc ();
 266   new_list->data = data;
 267   new_list->next = list;
 269   return new_list;
 270 }

Ah ha! Look at line 265. We are 100% definitely allocating memory, and returning it on line 269. Things are looking up! Let’s keep going with our questions.

are we overwriting *dupes on line 879? that might be a leak.

 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));

We’ve already proven to ourselves that line 877 doesn’t allocate any memory. It just sets dupes to some value. However, on line 879, we do allocate memory. It is equivalent to this code:

  int *dupes;
  dupes = 0x12345678;
  dupes = malloc(128);

So simply setting dupes to the return value of g_object_get_data() and later overwriting it with the return value of malloc() does not inherently cause a leak.

By way of counter-example, the below code is a memory leak:

  int *dupes;
  dupes = malloc(64);
  dupes = malloc(128);    /* leak! */

The above essentially illustrates the scenario I was worried about. I was worried that g_object_get_data() allocated memory, and then g_slist_prepend() also allocated memory which would have been a leak because the first value of dupes got scribbled over by the second value. My worry turned out to be incorrect, but that is the type of detective work you have to think about.

As a clearer example of why the above is a leak, consider the next snippet:

  int *dupes1, *dupes2;
  dupes1 = malloc(64);     /* ok */
  dupes2 = malloc(128);    /* ok */
  dupes1 = dupes2;         /* leak! */

First we allocate dupes1. Then allocate dupes2. Finally, we set dupes1 = dupes2, and now we have a leak. No one knows what the old value of dupes1 was, because we scribbled over it, and it is gone forever.

is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.

Next time, we’ll see how to actually fix the problem.

linux slab poisoning 101

During development of my PCI core learns hotplug patch series, Vegard Nossum decided he wanted to subject them to kmemcheck, which is a really cool tool he’s been working on.

One of the things he found was a use-after-free bug, and as we all learned in Programming 101 (at least when 101 was taught in a manly language like C as opposed to a modern-day metro language like python who wears pants that are too tight and throws whiny emo temper tantrums aka exceptions), you really really shouldn’t be using a pointer after you free it, especially if you’re attempting to write a non-sucky operating system.

Since I’m not as smart as Vegard, I didn’t use kmemcheck, but luckily for simpletons like me, the Linux kernel comes with several other nifty built-in tools that get you pretty close, namely slab poisoning. Unfortunately, there aren’t tons of documents that explain how to actually interpret slab poisoning data, maybe because all the great coders come from other planets or Europe where they apparently teach this stuff in preschool along with naptime and “punching a girl is not an acceptable method of indicating to her that she has slightly fewer cooties than the other cootie-infested harlots”.

What follows is a very cursory introduction to reading some slab poisoning information that might get you started down the right path, but only if you’re smarter than me. Then again, I drool in my sleep so that’s a pretty low bar we’re talking about.

After you turn on slab debugging, reboot into your kernel, and reproduce the error, you’ll see some scary output that looks like this:

BUG kmalloc-1024: Poison overwritten

INFO: 0xf6788508-0xf6788508. First byte 0x6a instead of 0x6b
INFO: Allocated in alloc_pci_dev+0x12/0x1d age=8751 cpu=7 pid=4423
INFO: Freed in pci_release_dev+0x42/0x47 age=619 cpu=7 pid=35
INFO: Slab 0xc1b117a0 objects=30 used=4 fp=0xf6788430 flags=0x350020c3
INFO: Object 0xf6788430 @offset=1072 fp=0xf6788860

Bytes b4 0xf6788420:  f8 0a 00 00 f2 84 fb ff 5a 5a 5a 5a 5a 5a 5a 5a �..���ZZZZZZZZ
  Object 0xf6788430:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788440:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788450:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788460:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788470:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788480:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788490:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884a0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884b0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884c0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884d0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884e0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788500:  6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b kkkkkkkkjkkkkkkk
  Object 0xf6788510:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
 Redzone 0xf6788830:  bb bb bb bb                                     ����
 Padding 0xf6788858:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
Pid: 4423, comm: bash Tainted: G        W  2.6.29-rc4 #2
Call Trace:
 [] print_trailer+0xe1/0xe9
 [] check_bytes_and_report+0x81/0xa4
 [] check_object+0xa4/0x18b
 [] __slab_alloc+0x3d6/0x4f0
 [] ? pci_find_next_bus+0x31/0x59
 [] kmem_cache_alloc+0x67/0xb2
 [] ? pci_add_new_bus+0x1f/0x10f
 [] ? pci_add_new_bus+0x1f/0x10f
 [] pci_add_new_bus+0x1f/0x10f
 [] pci_scan_bridge+0x127/0x3da
 [] pci_scan_child_bus+0x9a/0xd3
 [] pci_scan_bridge+0x150/0x3da
 [] pci_scan_child_bus+0x9a/0xd3
 [] pci_scan_bridge+0x150/0x3da
 [] pci_scan_child_bus+0x9a/0xd3
 [] pci_do_scan_bus+0x12/0x6a
 [] ? up_read+0x16/0x2c
 [] bus_rescan_store+0x41/0x57
 [] ? bus_rescan_store+0x0/0x57
 [] bus_attr_store+0x20/0x25
 [] sysfs_write_file+0xb9/0xe4
 [] ? sysfs_write_file+0x0/0xe4
 [] vfs_write+0x8a/0x12e
 [] sys_write+0x3b/0x60
 [] sysenter_do_call+0x12/0x31
FIX kmalloc-1024: Restoring 0xf6788508-0xf6788508=0x6b

Keep in mind that this is one of the world’s easiest-to-debug slab corruptions because that’s about what I can handle. Anything harder than this and I’ll be reading your blog entry.

Our first clue is this line:

INFO: 0xf6788508-0xf6788508. First byte 0x6a instead of 0x6b

and it’s an important one. It tells us that this is a use-after-free, because we were expecting 0x6b and we got something different. Now when slab debugging is turned on, every time you kfree an object, the allocator fills up the old memory with 0x6b. By way of contrast, uninitialized memory is filled with 0x5a bytes, so if you see that pattern show up in your debug output, you know that you’re using some pointer before you’ve initialized it. Bad penguin! No mackerel for you!

Now that we know that it’s a use-after-free bug, we can safely ignore the stack trace at the end, because that trace is telling us where we detected the bug, not where we created it. Mmmkay?

Ok, let’s try and figure out where our bug is.

The fact that the byte it’s complaining about is a 0x6a instead of a 0x6b means something got decremented. But… what? And that brings us to our next lines, which are:

INFO: Allocated in alloc_pci_dev+0x12/0x1d age=8751 cpu=7 pid=4423
INFO: Freed in pci_release_dev+0x42/0x47 age=619 cpu=7 pid=35

At this point, a little familiarity with your code is in order. Let’s see… what could possibly be allocated by alloc_pci_dev and freed in pci_release_dev. Hm…. what could it be…. what could it be….

Give up?

struct pci_dev *alloc_pci_dev(void)
        struct pci_dev *dev;

        dev = kzalloc(sizeof(struct pci_dev), GFP_KERNEL);
        if (!dev)
                return NULL;


        return dev;

It’s a … struct pci_dev *! Ta daaa!

So we’re touching a struct pci_dev * that we really shouldn’t be touching, and in fact, we’re decrementing something in that structure. Let’s go back our debug output:

INFO: 0xf6788508-0xf6788508. First byte 0x6a instead of 0x6b
  Object 0xf6788430:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk

Alrighty you pinball wizards, do the math! Me? I’m lazy, so I’ll just fire up gdb and have it tell me the answer:

(gdb) p 0xf6788508 - 0xf6788430
$1 = 216

Ah ha, we’re decrementing something that’s offset 216 bytes in our struct pci_dev *. Now at this point, you’re pretty much done, but calculating offsets by hand is a tedious task reserved for struggling undergraduates and possibly people who abuse small animals. I’ve already taken all my required tests and got my piece of parchment that says I paid too much for that hazy crazy 6-year party, so I get to cheat and use a tool that someone else has written.

Enter pahole.

The original intent of the tool was to help developers write tighter structures and fit them into cachelines (note to self, at this point, explain to current CS undergraduates that a cacheline has nothing to do with web 2.0 and has nothing to do with bling blau) but we can use it to quickly figure out what’s at byte 216.

You can run it on any object file that references a struct pci_dev

$ pahole drivers/pci/probe.o > probe.txt

Let’s peek inside our output file and see what we got…

struct pci_dev {
        struct list_head           bus_list;             /*     0     8 */
        struct pci_bus *           bus;                  /*     8     4 */
        struct pci_bus *           subordinate;          /*    12     4 */
        void *                     sysdata;              /*    16     4 */
        struct device              dev;                  /*    92   340 */

Ah, so somewhere inside a struct dev.

struct device {
        struct klist               klist_children;       /*     0    44 */
        struct klist_node          knode_parent;         /*    44    16 */
        struct klist_node          knode_driver;         /*    60    16 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        struct klist_node          knode_bus;            /*    76    16 */
        struct device *            parent;               /*    92     4 */
        struct kobject             kobj;                 /*    96    36 */
        /* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
        char                       bus_id[20];           /*   132    20 */

Almost, but not quite there. 92 + 96 is 188, but 92 + 132 is 224, so we’re somewhere inside that struct kobject

struct kobject {
        const char  *              name;                 /*     0     4 */
        struct list_head           entry;                /*     4     8 */
        struct kobject *           parent;               /*    12     4 */
        struct kset *              kset;                 /*    16     4 */
        struct kobj_type *         ktype;                /*    20     4 */
        struct sysfs_dirent *      sd;                   /*    24     4 */
        struct kref                kref;                 /*    28     4 */

Bingo! 92 + 96 is the start of our struct kobject and right at offset 28 is a struct kref. 92 + 96 + 28 = 216.

Let’s see what’s in that struct kref:

struct kref {
        atomic_t                   refcount;             /*     0     4 */

And that matches up fo’ shizzle with what the slab poisoning info was telling us — a byte was getting decremented at offset 216. That byte happened to be a refcount, which makes perfect sense — someone is trying to decrement a refcount after this object was kfreed.

Now you know what to go look for. Someone is calling pci_dev_put (or similar) after pci_release_dev has already been called.

If you’re lucky, you were writing a patch that was messing with refcounts and you simply put one too many in. At least you know where to start sprinkling printks.

If you’re unlucky and your patch has nothing to do with pci_dev refcounts, well, maybe it’s time to go to confessional (yes it’s been a while, hasn’t it?), say 8 Hail Marys, 2 Our Fathers, grab a cup of coffee, and go fix some other idiot’s code (perhaps some idiot that has a Web 6.0 blog).

sneaky pci_dev refcounts

Normal friends and family, just skip this post. Trust me.

Recently, while doing some PCI slot symlink work, I noticed that we weren’t calling pci_release_dev() after taking a card offline (via sysfs). Well, at least not the first time, due to a leaked refcount in pci_get_dev_by_id(). gregkh fixed it, so that’s good, but I had a slightly difficult time tracking down the exact problem.

See, I had instrumented pci_dev_get() and pci_dev_put() to call dump_stack() every time the device that I cared about was touched:

--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -789,6 +789,11 @@ struct pci_dev *pci_dev_get(struct pci_dev *dev)
        if (dev)
+       if (dev && (dev->bus->number == 1) && (PCI_SLOT(dev->devfn) == 2)) {
+               dump_stack();
+               printk(KERN_INFO "dev refcount: %dn",
+                       atomic_read(&dev->dev.kobj.kref.refcount));
+       }
        return dev;

[mutatis mutandis for pci_dev_put()]

But I wasn’t getting all the information I needed. Other pieces of code were playing with that device’s refcount too. So I did this instead (thanks to willy for the hint!):

@@ -983,7 +984,20 @@ int device_register(struct device *dev)
 struct device *get_device(struct device *dev)
-       return dev ? to_dev(kobject_get(&dev->kobj)) : NULL;
+       if (!dev)
+               return NULL;
+       if (dev->bus == &pci_bus_type) {
+               struct pci_dev *pci_dev = to_pci_dev(dev);
+               if ((pci_dev->bus->number == 1) && (PCI_SLOT(pci_dev->devfn) == 2)) {
+                       dump_stack();
+                       printk(KERN_INFO "dev refcount: %dn",
+                              atomic_read(&pci_dev->dev.kobj.kref.refcount));
+               }
+       }
+       return to_dev(kobject_get(&dev->kobj));

Armed with that patch, this is the list of code paths (as of 2.6.27) that can affect a struct pci_dev’s refcount:

* pci_dev_get
* pci_dev_put
* device_add -> get_device
* device_add -> klist_add_tail -> ... -> get_device
* device_del -> put_device
* device_del -> klist_del -> ... -> put_device
* bus_attach_device -> klist_add_tail -> ... -> get_device
* bus_remove_device -> klist_del -> ... -> put_device
* acpi_platform_notify -> acpi_bind_one -> get_device

There are lots of code paths that can potentially directly call device_add/del so if you’re debugging that, you’ll need to figure out which ones affect you. I included that acpi_bind_one() as an example of a surprising place where we might directly call get_device().

The only surprising gotcha might be the calls to klist_add_tail/klist_del. Those calls will be non-obvious to grep for while debugging your device’s lifetime. So if you are wondering why your struct pci_dev isn’t calling its ->release() properly, just patch get_device() directly, and be prepared to stare at the output for a while. Good luck.

[Note that device_add bumps the refcount by 2, but then calls put_device() during cleanup, for a net gain of +1.]

my first git-bisect ™

Pull latest 2.6.25-rcN git tree to refresh my physical PCI slot patches, hack-hack-hack, go to test them on my test machine, and ruh roh, something’s hanging.

Try a clean tree without any of my patches and still get the hang. Hmm…

Not sure when this bug crept in, but for some stupid reason, I decide to start with 2.6.23. git-bisect happily tells me that I only have 11,000 more commits to go. Yikes.

A few hours later, we have a smoking gun: pciehp regression hang.

Oh, and btw, death be to those who introduce non-bisectable commits. Even if your code is 100% bug-free, you still make life difficult for those trying to debug others’ code. Bleah!

Time to relax by reading the Baroque Cycle.