linux slab poisoning 101

During development of my PCI core learns hotplug patch series, Vegard Nossum decided he wanted to subject them to kmemcheck, which is a really cool tool he’s been working on.

One of the things he found was a use-after-free bug, and as we all learned in Programming 101 (at least when 101 was taught in a manly language like C as opposed to a modern-day metro language like python who wears pants that are too tight and throws whiny emo temper tantrums aka exceptions), you really really shouldn’t be using a pointer after you free it, especially if you’re attempting to write a non-sucky operating system.

Since I’m not as smart as Vegard, I didn’t use kmemcheck, but luckily for simpletons like me, the Linux kernel comes with several other nifty built-in tools that get you pretty close, namely slab poisoning. Unfortunately, there aren’t tons of documents that explain how to actually interpret slab poisoning data, maybe because all the great coders come from other planets or Europe where they apparently teach this stuff in preschool along with naptime and “punching a girl is not an acceptable method of indicating to her that she has slightly fewer cooties than the other cootie-infested harlots”.

What follows is a very cursory introduction to reading some slab poisoning information that might get you started down the right path, but only if you’re smarter than me. Then again, I drool in my sleep so that’s a pretty low bar we’re talking about.

After you turn on slab debugging, reboot into your kernel, and reproduce the error, you’ll see some scary output that looks like this:

BUG kmalloc-1024: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xf6788508-0xf6788508. First byte 0x6a instead of 0x6b
INFO: Allocated in alloc_pci_dev+0x12/0x1d age=8751 cpu=7 pid=4423
INFO: Freed in pci_release_dev+0x42/0x47 age=619 cpu=7 pid=35
INFO: Slab 0xc1b117a0 objects=30 used=4 fp=0xf6788430 flags=0x350020c3
INFO: Object 0xf6788430 @offset=1072 fp=0xf6788860

Bytes b4 0xf6788420:  f8 0a 00 00 f2 84 fb ff 5a 5a 5a 5a 5a 5a 5a 5a �..���ZZZZZZZZ
  Object 0xf6788430:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788440:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788450:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788460:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788470:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788480:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788490:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884a0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884b0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884c0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884d0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884e0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf67884f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf6788500:  6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b kkkkkkkkjkkkkkkk
  Object 0xf6788510:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
...
 Redzone 0xf6788830:  bb bb bb bb                                     ����
 Padding 0xf6788858:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
Pid: 4423, comm: bash Tainted: G        W  2.6.29-rc4 #2
Call Trace:
 [<c0287211>] print_trailer+0xe1/0xe9
 [<c028729a>] check_bytes_and_report+0x81/0xa4
 [<c0287391>] check_object+0xa4/0x18b
 [<c0288843>] __slab_alloc+0x3d6/0x4f0
 [<c0354391>] ? pci_find_next_bus+0x31/0x59
 [<c02889c4>] kmem_cache_alloc+0x67/0xb2
 [<c05681f8>] ? pci_add_new_bus+0x1f/0x10f
 [<c05681f8>] ? pci_add_new_bus+0x1f/0x10f
 [<c05681f8>] pci_add_new_bus+0x1f/0x10f
 [<c056a163>] pci_scan_bridge+0x127/0x3da
 [<c056a4b0>] pci_scan_child_bus+0x9a/0xd3
 [<c056a18c>] pci_scan_bridge+0x150/0x3da
 [<c056a4b0>] pci_scan_child_bus+0x9a/0xd3
 [<c056a18c>] pci_scan_bridge+0x150/0x3da
 [<c056a4b0>] pci_scan_child_bus+0x9a/0xd3
 [<c056b9b7>] pci_do_scan_bus+0x12/0x6a
 [<c023819e>] ? up_read+0x16/0x2c
 [<c03544e5>] bus_rescan_store+0x41/0x57
 [<c03544a4>] ? bus_rescan_store+0x0/0x57
 [<c03d51a5>] bus_attr_store+0x20/0x25
 [<c02c9c6d>] sysfs_write_file+0xb9/0xe4
 [<c02c9bb4>] ? sysfs_write_file+0x0/0xe4
 [<c028bfcd>] vfs_write+0x8a/0x12e
 [<c028c10a>] sys_write+0x3b/0x60
 [<c0202e11>] sysenter_do_call+0x12/0x31
FIX kmalloc-1024: Restoring 0xf6788508-0xf6788508=0x6b

Keep in mind that this is one of the world’s easiest-to-debug slab corruptions because that’s about what I can handle. Anything harder than this and I’ll be reading your blog entry.

Our first clue is this line:

INFO: 0xf6788508-0xf6788508. First byte 0x6a instead of 0x6b

and it’s an important one. It tells us that this is a use-after-free, because we were expecting 0x6b and we got something different. Now when slab debugging is turned on, every time you kfree an object, the allocator fills up the old memory with 0x6b. By way of contrast, uninitialized memory is filled with 0x5a bytes, so if you see that pattern show up in your debug output, you know that you’re using some pointer before you’ve initialized it. Bad penguin! No mackerel for you!

Now that we know that it’s a use-after-free bug, we can safely ignore the stack trace at the end, because that trace is telling us where we detected the bug, not where we created it. Mmmkay?

Ok, let’s try and figure out where our bug is.

The fact that the byte it’s complaining about is a 0x6a instead of a 0x6b means something got decremented. But… what? And that brings us to our next lines, which are:

INFO: Allocated in alloc_pci_dev+0x12/0x1d age=8751 cpu=7 pid=4423
INFO: Freed in pci_release_dev+0x42/0x47 age=619 cpu=7 pid=35

At this point, a little familiarity with your code is in order. Let’s see… what could possibly be allocated by alloc_pci_dev and freed in pci_release_dev. Hm…. what could it be…. what could it be….

Give up?

struct pci_dev *alloc_pci_dev(void)
{
        struct pci_dev *dev;

        dev = kzalloc(sizeof(struct pci_dev), GFP_KERNEL);
        if (!dev)
                return NULL;

        INIT_LIST_HEAD(&dev->bus_list);

        return dev;
}

It’s a … struct pci_dev *! Ta daaa!

So we’re touching a struct pci_dev * that we really shouldn’t be touching, and in fact, we’re decrementing something in that structure. Let’s go back our debug output:

INFO: 0xf6788508-0xf6788508. First byte 0x6a instead of 0x6b
...
  Object 0xf6788430:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk

Alrighty you pinball wizards, do the math! Me? I’m lazy, so I’ll just fire up gdb and have it tell me the answer:

(gdb) p 0xf6788508 - 0xf6788430
$1 = 216

Ah ha, we’re decrementing something that’s offset 216 bytes in our struct pci_dev *. Now at this point, you’re pretty much done, but calculating offsets by hand is a tedious task reserved for struggling undergraduates and possibly people who abuse small animals. I’ve already taken all my required tests and got my piece of parchment that says I paid too much for that hazy crazy 6-year party, so I get to cheat and use a tool that someone else has written.

Enter pahole.

The original intent of the tool was to help developers write tighter structures and fit them into cachelines (note to self, at this point, explain to current CS undergraduates that a cacheline has nothing to do with web 2.0 and has nothing to do with bling blau) but we can use it to quickly figure out what’s at byte 216.

You can run it on any object file that references a struct pci_dev

$ pahole drivers/pci/probe.o > probe.txt

Let’s peek inside our output file and see what we got…

struct pci_dev {
        struct list_head           bus_list;             /*     0     8 */
        struct pci_bus *           bus;                  /*     8     4 */
        struct pci_bus *           subordinate;          /*    12     4 */
        void *                     sysdata;              /*    16     4 */
...
        struct device              dev;                  /*    92   340 */

Ah, so somewhere inside a struct dev.

struct device {
        struct klist               klist_children;       /*     0    44 */
        struct klist_node          knode_parent;         /*    44    16 */
        struct klist_node          knode_driver;         /*    60    16 */
        /* --- cacheline 1 boundary (64 bytes) was 12 bytes ago --- */
        struct klist_node          knode_bus;            /*    76    16 */
        struct device *            parent;               /*    92     4 */
        struct kobject             kobj;                 /*    96    36 */
        /* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */
        char                       bus_id[20];           /*   132    20 */

Almost, but not quite there. 92 + 96 is 188, but 92 + 132 is 224, so we’re somewhere inside that struct kobject

struct kobject {
        const char  *              name;                 /*     0     4 */
        struct list_head           entry;                /*     4     8 */
        struct kobject *           parent;               /*    12     4 */
        struct kset *              kset;                 /*    16     4 */
        struct kobj_type *         ktype;                /*    20     4 */
        struct sysfs_dirent *      sd;                   /*    24     4 */
        struct kref                kref;                 /*    28     4 */

Bingo! 92 + 96 is the start of our struct kobject and right at offset 28 is a struct kref. 92 + 96 + 28 = 216.

Let’s see what’s in that struct kref:

struct kref {
        atomic_t                   refcount;             /*     0     4 */

And that matches up fo’ shizzle with what the slab poisoning info was telling us — a byte was getting decremented at offset 216. That byte happened to be a refcount, which makes perfect sense — someone is trying to decrement a refcount after this object was kfreed.

Now you know what to go look for. Someone is calling pci_dev_put (or similar) after pci_release_dev has already been called.

If you’re lucky, you were writing a patch that was messing with refcounts and you simply put one too many in. At least you know where to start sprinkling printks.

If you’re unlucky and your patch has nothing to do with pci_dev refcounts, well, maybe it’s time to go to confessional (yes it’s been a while, hasn’t it?), say 8 Hail Marys, 2 Our Fathers, grab a cup of coffee, and go fix some other idiot’s code (perhaps some idiot that has a Web 6.0 blog).