Linux Kernel Debugging

By Kaiwan N Billimoria
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    1 A General Introduction to Debugging Software

About this book

The Linux kernel is at the very core of arguably the world’s best production-quality OS. Debugging it, though, can be a complex endeavor.

Linux Kernel Debugging is a comprehensive guide to learning all about advanced kernel debugging. This book covers many areas in-depth, such as instrumentation-based debugging techniques (printk and the dynamic debug framework), and shows you how to use Kprobes. Memory-related bugs tend to be a nightmare – two chapters are packed with tools and techniques devoted to debugging them. When the kernel gifts you an Oops, how exactly do you interpret it to be able to debug the underlying issue? We’ve got you covered. Concurrency tends to be an inherently complex topic, so a chapter on lock debugging will help you to learn precisely what data races are, including using KCSAN to detect them. Some thorny issues, both debug- and performance-wise, require detailed kernel-level tracing; you’ll learn to wield the impressive power of Ftrace and its frontends. You’ll also discover how to handle kernel lockups, hangs, and the dreaded kernel panic, as well as leverage the venerable GDB tool within the kernel (KGDB), along with much more.

By the end of this book, you will have at your disposal a wide range of powerful kernel debugging tools and techniques, along with a keen sense of when to use which.

Publication date:
August 2022
Publisher
Packt
Pages
638
ISBN
9781801075039

 

1 A General Introduction to Debugging Software

Hello there! I welcome you on this, our journey, on learning how to go about debugging a really sophisticated, large, and complex piece of software that’s proven absolutely critical to both enterprise business as well as tiny embedded systems and everything in between – the Linux kernel, and, to some extent, learn to debug the Linux user space ecosystem as well.

Let's begin this very first chapter, and our journey on kernel (and user mode) debugging, by first understanding a little more on what a bug really is, and the origins and myths of the term debugging. Next, a glimpse at some actual real-world software bugs will (hopefully) provide the required inspiration and motivation (to firstly avoid bugs and then to find and fix your bugs, of course). You will be guided on how to setup an appropriate workspace to actually work on a custom kernel and debug issues, including setting up a full-fledged debug kernel. We’ll wrap up with some useful tips on debugging.

In this chapter we’re going to cover the following main topics:

  • Software debugging – what it is, origins, and myths
  • Software bugs – a few actual cases
  • Setting up the workspace
  • Debugging – a few tips
 

Technical requirements

You will require a modern and powerful desktop or laptop. We tend to use Ubuntu 20.04 LTS as the primary platform for this book. Ubuntu desktop specifies the recommended minimum system requirements (https://help.ubuntu.com/community/Installation/SystemRequirements) for the installation and usage of the distribution; do refer it to verify that your system (even a guest) is up to it.

Cloning this book’s code repository

The complete source code for this book is freely available on GitHub at https://github.com/PacktPublishing/Linux-Kernel-Debugging. You can work on it by cloning the Git tree using the following command:

git clone https://github.com/PacktPublishing/Linux-Kernel-Debugging

The source code is organized chapter-wise. Each chapter is represented as a directory in the repository – for example, ch1/ has the source code for this chapter. A detailed description on installing a viable system is covered in the Setting up the workspace section.

 

Software debugging – what it is, origins, and myths

In the context of the software practitioner, a bug is a defect, an error, within the code. A key, and often large, part of our job as software developers is to hunt them down and fix them, so that, as far as is humanely possible, the software is defect-free and runs precisely as designed.

Of course, to fix a bug, you first have to find it. Indeed, with non-trivial bugs, it’s often the case that you aren’t even aware there is a bug (or several) until some event occurs to expose it! Shouldn’t we have a disciplined approach to finding bugs before shipping the product or project? Of course we do – it’s the Quality Assurance (QA) process, more commonly known as testing. Though glossed over at times, testing remains one of the – if not the – most important facets of the software lifecycle (would you voluntarily fly in a new aircraft that’s never been tested? Well, unless you’re the lucky test pilot...).

Okay, back to bugs; once identified (and filed), your job as a software developer is to now identify what exactly is causing it, what the actual underlying root cause is. A large portion of this book is devoted to tools, techniques, and just thinking about how to exactly do this. Once the root cause is identified, and you have clearly understood the underlying issue, you can, in all probability, fix it. Yay!

This process of identifying a bug – using tools, techniques, some hard thinking to figure out it’s root cause – and then fixing it, is subsumed into the word debugging. Without bothering to go into details, there’s a popular story regarding the origin of the word debugging: on a Tuesday at Harvard University (on Sept 9, 1947), Admiral Grace Hopper’s staff discovered a moth caught in a relay panel of a Mark II computer. As the system malfunctioned because of it, they removed the moth, thus de-bugging the system! Well, as it turns out: one, Admiral Hopper has herself stated that she didn’t coin the term; two, its origins seem to be rooted in aeronautics. Nevertheless, the term debugging has stuck. The following image shows the picture at the heart of this story - the unfortunate but famous moth that inadvertently caught itself in the system that then had to be de-bugged!

Figure 1.1 – The famous moth (By Courtesy of the Naval Surface Warfare Center, Dahlgren, VA., 1988. - U.S. Naval Historical Center Online Library Photograph NH 96566-KN. Public Domain, https://commons.wikimedia.org/w/index.php?curid=165211 )

Having understood what a bug and debugging basically is, let’s move onto something interesting and important – we'll briefly examine a few real-world cases where a software bug (or bugs) has been the cause of some unfortunate and tragic accidents.

 

Software bugs – a few actual cases

Using software to control electro-mechanical systems is not only common, it’s pretty much all pervasive in today’s world. The unfortunate reality though, is that software engineering is a relatively young field and that we humans are naturally prone to make mistakes; these factors can combine to create unfortunate accidents when software doesn’t execute conforming to its design (which, of course, is called buggy).

Several real-world examples of this occurring exist; we highlight a few of them in the following sub sections. The brief synopsis given here is really just that – (too) brief; to truly understand the complex issues behind failures like this, you do need to take the trouble to study the technical crash (or failure) investigation reports in detail (do see the links in the Further reading section of this chapter). Here, I briefly mention and summarize these cases to: one, underline the fact that software failure, even in large, heavily tested systems, can and does occur, and two, to motivate all of us involved in any part of the software life cycle to pay closer attention, to stop making assumptions, and to do a better job in designing, implementing and testing the software we work on.

Patriot missile failure

During the Gulf War, the US had deployed a Patriot missile battery in Dharan, Saudi Arabia. Its job was to track, intercept, and destroy incoming Iraqi Scud missiles. But, on 25 February 1991, a Patriot system failed to do so, causing the death of 28 soldiers and injury to about a hundred others. An investigation revealed that the problem’s root was at the heart of the software tracking system. Briefly, the system uptime was tracked as a monotonically increasing integer value. It was converted to a real – floating point – value by multiplying the integer by 1/10 (which is a recurring binary expression evaluating to 0.00011001100110011001100110011001100...). The trouble is, the computer’s used a 24-bit register for this conversion, resulting in the computation being truncated at 24 bits; this caused a loss of precision which only became significant when the time quantity was sufficiently large.

This was exactly the case that day; the Patriot system had been up for about 100 hours; thus, the loss of precision during the conversion translated to an error of approximately 0.34 seconds. Doesn’t sound like much, except that a Scud missile’s velocity is about 1,676 meters per second, thus resulting in a tracking error of about 570 meters. This was large enough for the Scud to be outside the Patriot tracking systems range gate, and was thus not detected.

Again, a case of loss of precision during conversion from an integer value to a real (floating point) number value.

ESA’s unmanned Ariane 5 rocket

On the morning of 4th June 1996, the European Space Agency’s (ESA’s) Ariane 5 unmanned rocket launcher took off from the Guiana Space Centre, off the South American coast of French Guiana. A mere forty seconds into its flight, the rocket lost control and exploded. The final investigation report revealed that the primary cause ultimately came down to a software overflow error.

It’s more complex than that; a brief summary of the chain of events leading to the loss of the rocket follows. (One realizes that, in most cases like this, it’s not one single event that causes an accident; rather, it’s a chain of several events). The overflow error occurred during the execution of code converting a 64-bit floating point value to a 16-bit signed integer; an unprotected conversion gave rise to an exception (Operand Error; the programming language was Ada); this in turn, occurred due to a much higher than expected internal variable value (BH – Horizontal Bias). The exception caused the shutdown of the Inertial Reference System (SRI) systems; this caused the primary onboard computer (OBC) to send erroneous commands to the nozzle deflectors resulting in full nozzle deflection of the boosters and the main Vulcain engine, which caused the rocket to veer dramatically off its flight path.

The irony is that the SRI’s were, by default, not even supposed to function after launch; but due to a delay in the launch window the design specified that they remain active for 50 seconds after launch! An interesting analysis of why this software exception wasn’t caught during development and testing (https://archive.eiffel.com/doc/manuals/technology/contract/ariane/) boils down to concluding that the fault lies in a reuse error:

“The SRI horizontal bias module was reused from a 10-year-old software, the software from Ariane 4.”

Mars Pathfinder reset issue

On July 4, 1997, NASA’s Pathfinder lander touched down on the surface of Mars and proceeded to deploy its smaller robot cousin – the Sojourner rover, the very first wheeled device to embark upon another planet! The lander suffered from periodic reboots; the problem was ultimately diagnosed as being a classic case of Priority Inversion – a situation where a high priority task is made to wait for lower priority tasks. As such, this by itself may not cause an issue; the trouble is that the high priority task was left off CPU long enough for the watchdog timer to expire, causing the system to reboot.

An irony here was that there exists a well-known solution – enabling the priority inheritance feature of the semaphore object (allowing the task taking the semaphore lock to have its priority raised to the highest on the system – for the duration of its holding the lock – thus enabling it to complete its critical section and release the lock quickly, preventing starvation of higher priority tasks). The VxWorks RTOS defaulted to having it off and the Jet Propulsion Laboratory (JPL) team left it that way. Because they allowed the robot to continuously stream telemetry debug data to Earth, they were able to correctly determine this root cause and thus fix it – enabling priority inheritance. An important lesson here; as the team lead Glenn Reevs put it:

“we test what we fly and we fly what we test”

I’d venture that these articles (see the Further reading section) are a must-read for any systems software developer!

The Boeing 737 MAX aircraft – the MCAS and lack of training to the flight crew

Two unfortunate accidents, taking in all 346 lives, put the Boeing 737 MAX under the spotlight; the crash of the Lion Air Flight 610 from Jakarta into the Java Sea (29 Oct 2018) and the crash of Ethiopian Airlines Flight 302 from Nairobi into the desert (10 Mar 2019). These incidents occurred just 13 and 6 minutes after take-off, respectively.

Of course, the situation is complex; at one level, this is what has likely caused these accidents: once Boeing determined that the aerodynamic characteristics of the 737 MAX left something to be desired, they worked on fixing it via a hardware approach. When that did not suffice, engineers came up with (what seemed) an elegant and relatively simple software fix, christened the maneuvering characteristics augmentation system (MCAS). Two sensors on the nose continually measures the aircraft’s angle of attack (AoA); when the AoA is determined to be too high, this typically entails a pending stall (dangerous!); the MCAS kicks in, moving control surfaces on the tail elevator, causing the nose to go down and stabilizing the aircraft. But: for whatever reasons, the MCAS was designed to use only one of the sensors; if it did fail, the MCAS could automatically activate, causing the nose to go down and the aircraft to lose altitude; this is what seems to have actually occurred in both crashes.

Further, many pilot crews weren’t explicitly trained on managing the MCAS (some claimed they weren’t even aware of it!). The luckless flights pilot’s apparently did not manage to override the MCAS, even when no actual stall occurred.

Other cases

A few other examples of such cases are as follows:

  • June 2002, Fort Drum: a US Army report maintained that a software issue contributed to the death of two soldiers. This incident occurred when they were training to fire artillery shells; apparently, unless the target altitude is explicitly entered into the system, the software assumes a default of zero. Fort Drum is apparently 679 feet ASL
  • In November 2001, a British engineer, John Locker, noticed that he could easily intercept American military satellite feeds, live imagery of US spy planes over the Balkans. The almost unbelievable reason – the stream was being transmitted unencrypted, enabling pretty much anyone in Europe with a regular satellite TV receiver to see it! In today’s context, many IoT devices have similar issues...
  • Jack Ganssle, a veteran and widely known embedded systems developer and author, brings out the excellent TEM – The Embedded Muse – newsletter bi-monthly; every issue has a section entitled Failure of the Week, typically highlighting a hardware and/or software failure; do check it out!
  • Read the web page on Software Horror Stories here (http://www.cs.tau.ac.il/~nachumd/horror.html); though old, it provides many examples of software gone wrong with, at times, tragic consequences.

Again, if interested in digging deeper, I urge you to read the detailed official reports on these accidents and faults; the Further reading section has several relevant links.

By now, you should be itching to begin debugging on Linux! Let's do just that – begin – by first setting up the workspace.

 

Setting up the workspace

Firstly, you’ll have to decide whether to run your test Linux system as a native system (on the bare metal) or as a guest OS; we cover the factors that will help you decide. Next, we (briefly) cover the installation of some software (the guest additions) for the case where you use a guest Linux OS, followed by the required software packages to install.

Running Linux as a native or guest OS?

Ideally, you should run a recent Linux distribution (Ubuntu, Fedora, and so on) on native hardware. We tend to use Ubuntu 20.04 LTS in this book as the primary system to experiment upon. The more powerful your system – in terms of RAM, processing power and disk space – the better! Of course, as we shall be debugging at the level of the kernel, crashes and even data loss (chances of the latter are small, but nevertheless...) can occur; hence, the system should be a test one with no valuable data on it.

If running Linux on native hardware – on the bare metal, as it were - isn’t feasible for you, then a practical and convenient alternative is to install and use the Linux distribution as a guest OS on a Virtual Machine (VM). It's important to install a recent Linux distribution.

Running a Linux guest as a VM is certainly feasible but (there’s always a but isn’t it!), it will almost certainly feel a lot slower than running Linux natively. Still, if you must run a Linux guest, it certainly works; it goes without saying – the more powerful your host system, the better the experience. There’s also an arguable advantage to running your test system as a guest OS: even if it does crash (please do expect that to happen, especially with the deliberate (de)bugging we’ll do with this book!), you don’t even need to reboot the hardware; merely reset the hypervisor software running the guest (typically Oracle VirtualBox).

Alternate hardware – using Raspberry Pi (and other) ARM-based systems

Though we specified that you can run a recent Linux distro either as a native system or as a guest VM, the assumption was that it’s an x86_64 system. While that suffices, to get more out of the experience of this book (and simply to have more fun), I highly recommend you also try out the sample code and run the (buggy) test cases on alternate architectures. With many, if not most, modern embedded Linux systems being ARM based (on both 32-bit ARM and 64-bit Aarch64 processors), the Raspberry Pi hardware is extremely popular, relatively cheap and has tremendous community support, making it an ideal test bed. I do use it every now and then within this book, in the chapters that follow; I’d recommend you do the same!

All the details – installation, setup, and so on – are amply covered in the well documented Raspberry Pi documentation pages here: https://www.raspberrypi.org/documentation/.

Ditto for another popular embedded prototyping board - TI’s BeagleBone Black (affectionately, the BBB). This site is a good place to get started with the BBB: https://beagleboard.org/black.

Running Linux as a guest OS

If you do decide to run Linux as a guest, I’d recommend using Oracle VirtualBox 6.x (or the latest stable version) as a comprehensive and powerful all-in-one GUI hypervisor application appropriate for a desktop PC or laptop. Other virtualization software, such as VMware Workstation or QEMU, should also be fine. All of these are freely available and open source. It's just that the code for this book has been tested on Oracle VirtualBox 6.1. Oracle VirtualBox is considered Open Source Software (OSS) and is licensed under the GPL v2 (the same as the Linux kernel). You can download it from https://www.virtualbox.org/wiki/Downloads. Its documentation can be found here: https://www.virtualbox.org/wiki/End-user_documentation.

The host system should be either MS Windows 10 or later (of course, even Windows 7 will work), a recent Linux distribution (for example, Ubuntu or Fedora), or macOS.

The guest (or native) Linux distribution can be any sufficiently recent one. For the purpose of following along the material and examples presented in this book, I’d recommend installing Ubuntu 20.04 LTS. This is what I primarily use for the book.

How can you quickly check which Linux distribution is currently installed and running?

On Debian/Ubuntu, the lsb_release –a command should do the trick; for example, on my guest Linux:

$ lsb_release –a 2> /dev/null
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:       20.04
Codename:      focal
$

How can one check if the Linux currently running is on native hardware or is a guest VM (or a container)? There are many ways to do so; the script virt-what is one (we will be installing it); other commands include hostnamectl(1), dmidecode(8) (on x86), systemd-detect-virt(1) (if systemd is the initialization framework), lshw(1) (x86, IA-64, PPC), raw ways via dmesg(1) (grepping for Hypervisor detected), and via /proc/cpuinfo.

In this book, I shall prefer to focus on setting up what is key from a kernel (and user) debug perspective; hence, we won’t discuss the in-depth details on installing a guest VM (typically on a Windows host running Oracle VirtualBox) here. If you require some help on this, please refer to the many links to tutorials on precisely this within the Further reading section for this chapter. (FYI, these details, and a lot more, are amply covered in my previous book Linux Kernel Programming, Chapter 1, Kernel Workspace Setup).

Tip – using prebuilt VirtualBox images

The OSBoxes project allows you to freely download and use prebuilt VirtualBox (as well as VMware) images for popular Linux distributions. See their site here: https://www.osboxes.org/virtualbox-images/.

In our case, you can download a prebuilt x86_64 Ubuntu 20.04.3 (as well as other) Linux image here: https://www.osboxes.org/ubuntu/. It comes with the guest additions preinstalled. The default username/password is osboxes/osboxes.org.

(Of course, for more advanced readers, you’ll realize it’s really up to you. Running an as- light-as-possible Linux on a Qemu (emulated) standard PC is a choice here).

Note that if your Linux system is installed natively on the hardware platform or you’re using an OSBoxes Linux with the VirtualBox guest additions preinstalled, or you’re using a Qemu-emulated PC, simply skip the next section.

Installing the Oracle VirtualBox guest additions

The guest additions are essentially software (para-virtualization accelerators) that quite dramatically enhances the performance, as well as the look and feel, of the experience of running a guest OS on the host system; hence, it’s important to have it installed. (Besides acceleration, the guest additions provide conveniences like the ability to nicely scale the GUI window, and share facilities, such as, folders, the clipboard, and drag and drop between the host and the guest).

Before doing this though, please ensure you have already installed the guest VM (as mentioned previously. Also, the first time you login, the system will likely prompt you to update and possibly restart; please do so). Then, follow along:

  1. Log in to your Linux guest VM (I’m using the login name letsdebug; guess why!) and first run the following commands within a Terminal window (on a shell):

    sudo apt update
    sudo apt upgrade
    sudo apt install build-essential dkms linux-headers-$(uname –r) ssh -y

    (Ensure you run each command above on one line).

  2. Install the Oracle VirtualBox Guest Additions now. Refer to How to Install VirtualBox Guest Additions in Ubuntu: https://www.tecmint.com/install-virtualbox-guest-additions-in-ubuntu/
  3. On Oracle VirtualBox, to ensure that you have access to any shared folders you might have setup, you need to set the guest account to belong to the vboxsf group; you can do so like this (you’ll require to log in again, or sometimes even reboot, to have this take effect):

    sudo usermod -G vboxsf -a ${USER}

The commands (step 1), after updating, has us install the build-essential package along with a couple of others; this ensures that the compiler (gcc), make, and other essential build utility programs are installed so that the Oracle VirtualBox Guest Additions can be properly installed straight after (in step 2).

Installing required software packages

To install the required software packages, perform the following steps (do note that, here, we assume the Linux distribution is our preferred one, Ubuntu 20.04 LTS):

  1. Within your Linux system (be it a native one or a guest OS), first do the following:

    sudo apt update

    Now, to install the remaining required packages for the kernel build, run the following command in a single line:

    sudo apt install bison flex libncurses5-dev ncurses-dev xz-utils libssl-dev libelf-dev util-linux tar -y

    (The -y option switch has apt-get(8) assume a yes answer to all prompts; careful though, this could be dangerous in other circumstances).

  2. To install packages required for work we’ll do in other parts of this book, run the following command in a single line:

    sudo apt install bc bpfcc-tools bsdmainutils clang cmake cppcheck cscope curl \
    dwarves exuberant-ctags fakeroot flawfinder git gnome-system-monitor gnuplot \
    hwloc indent kernelshark libnuma-dev libjson-c-dev linux-tools-$(uname -r) \
    net-tools numactl openjdk-16-jre openssh-server perf-tools-unstable psmisc \
    python3-distutils rt-tests smem sparse stress sysfsutils tldr-py trace-cmd \
    tree tuna virt-what -y

Tip - a script to auto-install required packages

To make the (immediately above-mentioned) package install task simpler, you can make use of a simple bash script that’s part of the GitHub repo for this book: pkg_install4ubuntu_lkp.sh. It’s been tested on an x86_64 osboxes Ubuntu 20.04.3 LTS VM (running on Oracle VirtualBox 6.1).

Great; now that the required packages are all installed, lets proceed with understanding the next portion of our workspace setup – the need for two kernels!

 

A tale of two kernels

When working on a project or product, there obviously will be a Linux kernel that will be deployed as part of the overall system.

Information box

A quick aside: a working Linux system minimally requires a bootloader, a kernel, and root filesystem images.

This system that’s deployed to the outside world is in general termed the production system and the kernel as the production kernel, as, of course, it’s the one that runs while it’s being used in the field (or on-premise, or at the customer location). Here, we'll limit our discussion to the kernel only. The configuration, build, test, debug, and deployment of the production kernel is, no doubt, a key part of the overall project.

Do note though, in many systems (especially the enterprise-class ones), the production kernel is often simply the default kernel that’s supplied by the vendor (RedHat, SuSe, Canonical, or others). On most embedded Linux projects and products, this is likely not the case: the platform (or Board Support Package (BSP) team or a vendor will select a base mainline kernel (typically from kernel.org) and work on it; this can include enhancements, careful configuration, and deployment of the custom-built production kernel.

For the purpose of our discussion, let's assume that we require to configure and build a custom kernel.

A production and a debug kernel

However (and especially after having read the earlier section Software bugs – a few actual cases), you will realize that there’s always the off-chance that even the kernel – more likely the code you and your team added to it (the kernel modules, drivers, interfacing components) – has hidden faults, bugs. With a view to catching them before the system hits the field, thorough testing / QA is of prime importance!

Now, the issue is this: unless certain deeper checks are enabled within the kernel itself, it’s entirely possible that they can escape your test cases. So, why not simply enable them? Well, one, these deeper checks are typically switched off by default in the production kernel’s configuration. Two, when turned on, they do result in performance degradation, at times quite significantly.

So, where does that leave us? Simple, really: you should plan on working with at least two, and possibly three, kernels:

  • One, a carefully tuned production kernel, geared towards efficiency, security, and performance
  • Two, a carefully configured debug kernel, geared towards catching pretty much all kinds of bugs! Performance is not a concern here, catching bugs is
  • Three (optional, case-by-case): a kernel with one or more very specific debug config options enabled and the rest turned off

The second one, the so-called debug kernel is configured in such a way that all required or recommended debug options are turned on, enabling you to (hopefully!) catch those hidden bugs. Of course, performance might suffer as a result, but that’s okay; catching – and subsequently fixing – kernel-level bugs are well worth it. Indeed, in general, during development and (unit) testing, performance isn’t paramount; catching and fixing deeply hidden bugs is!

The debug kernel is only used during development, test, and very possibly later, when bugs do actually surface. How exactly it’s used later is something we shall certainly cover in the course of this book.

Also, this point is key: it usually is the case that the mainline (or vanilla) kernel that your custom kernel is based upon is working fine; the bugs are generally introduced via custom enhancements and kernel modules. As you will know, we typically leverage the kernel’s Loadable Kernel Module (LKM) framework to build custom kernel code – the most common being device drivers; it can also be anything else: custom network filters / firewall, a new filesystem or I/O scheduler. These are out-of-tree kernel components (typically some .ko files) that become part of the root filesystem (they’re usually installed into /lib/modules/$(uname –r)). The debug kernel will certainly help catch bugs in your kernel modules as their test cases are executed, as they run.

The third kernel option – an in-between of the first two - is optional of course; from a practical real-world point of view, it may be exactly what’s required on a given setup. With certain kernel debug systems turned on, to catch specific types of bugs that you’re hunting (or anticipate) and the rest turned off, it can become a pragmatic way to debug even a production system, keeping performance high enough.

For practical reasons, in this book, we’ll configure, build and make use of the first two kernels – a custom production one and a custom debug one, only; the third option is yours to configure as you gain experience on both the kernel debug features and tools as well as your particular product or project.

Which kernel release to use?

A key topic: the Linux kernel project is often touted as the most successful opensource project ever, with literally hundreds of releases and a release cadence that’s truly phenomenal for such an enormous project (it averages a new release every 6 to 8 weeks!). Among them, which one should we use (as a starting point, at least)?

It’s really important to use the latest stable kernel version, as it will include all the latest performance and security fixes. Not just that, the kernel community has different release types, which determines how long a given kernel release will be maintained (bug and security fixes applied, as they become known). For typical projects or products, selecting the latest stable Long Term Stable (LTS) kernel release, thus makes the best sense. Of course, as already mentioned, on many projects – typically the server / enterprise class ones – the vendor (RedHat, SuSe, and others) might well supply the production kernel to be used; here, for the purpose of our learning, we’ll start from scratch, configure, and build a custom Linux kernel ourselves (as is often the case on embedded projects).

As of this writing, the latest LTS Linux kernel is 5.10 (particularly, version 5.10.60); I shall use this kernel throughout this book. (You will realize that by the time you’re reading this, it’s entirely possible, in fact very likely, that the latest LTS kernel has evolved to a newer version). Besides, a key point in our favor – the 5.10 LTS kernel will be supported by the community until December 2026, thus keeping it relevant and valid for a pretty long time!

So, great, let's get to configuring and building both our custom production and debug kernels! We’ll begin with the production one.

Setting up our custom production kernel

(Here, I shall have to assume you are familiar with the general procedure involved in building a Linux kernel from source: obtaining the kernel source tree, configuring, and building it. In case you’d like to brush up on this, the Linux Kernel Programming – Part 1 book covers this in a lot of detail. As well, do refer the tutorials and links in the Further reading section of this chapter).

Though this is meant to be our production kernel, we’ll begin with a rather simplistic default that’s based on the existing system (this approach is sometimes called the tuned kernel config via the localmodconfig one. FYI, this, and a lot more, is covered in depth in the Linux Kernel Programming – Part 1 book). Once we’ve got a reasonable starting point, we’ll further tune the kernel for security. Let’s begin by performing some base configuration:

  1. Create a new directory in which you work upon the upcoming production kernel:

    mkdir –p ~/lkd_kernels/productionk

    Bring in the kernel source tree of your choice. Here, as mentioned in Which kernel release to use? , we shall use the latest (at the time of writing this) LTS Linux kernel, version 5.10.60:

    cd ~/lkd_kernels
    wget https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.10.60.tar.xz

    Notice that here we have simply used the wget(1) utility to bring in the compressed kernel source tree; there are several alternate ways (including using git(1))

    Note

    As you’ll know, the number in parentheses following the command name – for example, wget(1) – is the section within the manual or man pages where documentation on this command can be found).

  2. Extract the kernel source tree:

    tar xf linux-5.10.60.tar.xz --directory=productionk/
  3. Switch to the directory it’s just been extracted into (using cd linux-5.10.60) and briefly verify the kernel version information as shown in the following screenshot:

    Figure 1.2 – Screenshot of the LTS kernel source tree

    Every kernel version is christened with a (rather exotic) name; our 5.10.60 LTS kernel has an appropriately nice name (Dare mighty things), don’t you think?

  4. Configure for appropriate defaults. This is what you can do to obtain a decent, tuned starting point for kernel config based on the current config:

    lsmod > /tmp/lsmod.now
    make LSMOD=/tmp/lsmod.now localmodconfig

    Note

    The preceding command might interactively ask you to specify some choices; just selecting the defaults (by pressing the Enter key) is fine for now. The end result is the kernel configuration file saved as .config in the root of the kernel source tree (the current directory).

    We backup the config file as follows:

    cp –af .config ~/lkd_kernels/kconfig_prod01

Tip

You can always do make help to see the various options (including config) available to you.

Before jumping into the building our production kernel, it’s really important to consider the security aspect. Let’s first configure our kernel to be more secure, hardened.

Securing your production kernel

With security being a major concern, the modern Linux kernel has many security and kernel hardening features. The thing is, there always tends to be a trade-off between security and convenience/performance. Thus, many of these hardening features are off by default; several are designed as an opt-in system: if you want it, turn it on by selecting it from the kernel config menu (via the familiar make menuconfig UI). This makes sense to do, especially on a production kernel.

The question is: how will I know which exactly config features regarding security to turn on or off? There’s literature on this and, better, some utility scripts which examine your existing kernel config and can make recommendations based on existing state-of-the-art security best practice! One such tool is Alexander Popov’s kconfig-hardened-check Python script (https://github.com/a13xp0p0v/kconfig-hardened-check). Here’s a screenshot of a portion of its output, when I ran it against my custom kernel configuration file:

Figure 1.3 – Partial screenshot – truncated output from the kconfig-hardened-check script

(We won’t be attempting to go into details regarding the useful kconfig-hardened-check script here, as it’s beyond this book’s scope; do lookup the GitHub link provided to see details). Having followed most of the recommendations from this script, I generated a kernel config file:

$ ls -l .config
-rw-rw-r-- 1 letsdebug letsdebug 156646 Aug 19 13:02
.config
$ 

Note

My kernel config file for the production kernel can be found on the book’s GitHub code repo here: https://github.com/PacktPublishing/Linux-Kernel-Debugging/blob/main/ch1/kconfig_prod01. (FYI, our custom debug kernel config fie, that we’ll be generating in the following section, can be found within the same folder as well).

Now that we have appropriately configured our custom production kernel, let’s build it; the following commands should do the trick (with nproc(1) helping us with the number of CPU cores onboard):

$ nproc
4
$ make –j8
[ ... ]
BUILD   arch/x86/boot/bzImage
Kernel: arch/x86/boot/bzImage is ready  (#1)
$ 

Information box

If you’re working on a typical embedded project, you will require to install a toolchain and cross compile the kernel. Also, you’d normally set the environment variable ARCH to the machine type (for example, ARCH=arm64) and the environment variable CROSS_COMPILE to the cross compiler prefix (for example, CROSS_COMPILE= aarch64-none-linux-gnu-). Your typical embedded Linux builder systems – Yocto and Buildroot being very common – pretty much automatically take care of this.

As you can see, as a thumb-rule, we set the number of jobs to execute as twice the number of CPU cores available via make’s -j option switch. The build should complete in a few minutes; once done, let’s check that the compressed and uncompressed kernel image files have been generated:

$ ls -lh arch/x86/boot/bzImage vmlinux
-rw-rw-r-- 1 letsdebug letsdebug 9.1M Aug 19 17:21
arch/x86/boot/bzImage
-rwxrwxr-x 1 letsdebug letsdebug  65M Aug 19 17:21 vmlinux*
$ 

Note that it’s always only the first one, bzImage – the compressed kernel image – that we shall boot from. Then what’s the second image, vmlinux, for? Very relevant here: it’s what we shall (later) often require when we need to perform kernel debug! It’s the one that holds all the symbolic information, after all. Our production kernel config will typically cause several kernel modules (LKMs) to be generated within the kernel source tree. They have to be installed into a well-known location (/lib/modules/$(uname –r)); this is achieved by doing, as root:

$ sudo make modules_install
[sudo] password for letsdebug: xxxxxxxxxxxxxxxx 
  INSTALL arch/x86/crypto/aesni-intel.ko
  INSTALL arch/x86/crypto/crc32-pclmul.ko
[ ... ]
   DEPMOD  5.10.60-prod01
$ ls /lib/modules/
5.10.60-prod01/  5.11.0-27-generic/  5.8.0-43-generic/
$ ls /lib/modules/5.10.60-prod01/
[email protected]  modules.alias.bin modules.builtin.bin modules.dep.bin  modules.softdep [email protected] kernel/        modules.builtin  modules.builtin.modinfo  modules.devname  modules.symbols modules.alias  modules.builtin.alias.bin  modules.dep modules.order modules.symbols.bin
$ 

Final step, we make use of an internal script to generate the initramfs image and setup the bootloader (in this case, on our x86_64, it’s GRUB) by simply running:

sudo make install

For details and conceptual understanding of the initramfs, as well as some basic GRUB tuning, do see the Linux Kernel Programming – Part 1 book. We also provide useful references within the Further reading section for this chapter.

Now all that’s left to do is reboot your guest (or native) system, interrupt the bootloader (typically by holding the Shift key down during early boot) and selecting the newly built production kernel:

Figure 1.4 – Screenshot showing the GRUB bootloader screen and the new production kernel to boot from

As you can see from the preceding screenshot, I’m running the system as a guest OS via Oracle VirtualBox; holding the Shift key down during early boot got this bootloader screen to show up; I scrolled down, selected the new production kernel and pressed [Enter] to boot into it.

Voila, we’re now running our (guest) system with our brand new production kernel:

$ uname -a
Linux dbg-LKD 5.10.60-prod01 #1 SMP PREEMPT Thu Aug 19 17:10:00 IST 2021 x86_64 x86_64 x86_64 GNU/Linux
$ 

Information box

The new kernel should run just fine with the existing root filesystem – the libraries and applications are loosely coupled with the OS, allowing different versions of the kernel (one at a time, of course) to simply mount and use them. Also, you may not get all the bells and whistles; for example, on my guest OS with our new production kernel, the screen resizing, shared folders, and so on. Features are missing. How come? They depend on the guest additions whose kernel modules haven’t been built for this kernel. In this case, I find it a lot easier to work on the guest using the console over SSH. To do so, I installed the dropbear lightweight SSH server on the guest and then logged in over SSH from my host system. Windows users might like to try an SSH client like putty. (In addition, you might require setting up another bridged mode network adapter on the Linux guest).

You can (re)check the current kernel’s configuration by looking up /boot/config-$(uname –r). In this case, it should be that of our production kernel, tuned towards security and performance.

Tip

To have the GRUB bootloader prompt always show up at boot: make a copy of /etc/default/grub (to be safe), then edit it, adding the line GRUB_HIDDEN_TIMEOUT_QUIET=false and (possibly) commenting out the line GRUB_TIMEOUT_STYLE=hidden

Change the GRUB_TIMEOUT value from 0 to 3 (seconds). Run sudo update-grub to have the changes take effect, and reboot to test.

So, good, you now have your guest (or native) Linux OS running a new production kernel. During the course of this book, you shall encounter various kernel-level (and some user-mode) bugs while running this kernel. Identifying the bug(s) will often involve your booting via the debug kernel instead. So, let’s now move onto creating a custom debug kernel for the system. Read on!

Setting up our custom debug kernel

As you have already setup a production kernel (as described in detail in the previous section), I won’t repeat every step in detail here, just the ones that differ:

  1. Firstly, ensure you have booted into the production kernel that you built in the previous section; this is to ensure that our debug kernel config uses it as a starting point:

    $ uname –r
    5.10.60-prod01
  2. Create a new working directory and extract the same kernel version again. It’s important to build the debug kernel in a separate workspace from that of the production one; true, it takes a lot more disk space but it keeps them clean and from stepping on each other’s toes as you perhaps modify their configs:

    mkdir –p ~/lkd_kernels/debugk
  3. We already have the kernel source tree (we earlier used wget to bring in the 5.10.60 compressed source); let’s reuse it, this time extracting it into the debug kernel work folder:

    cd ~/lkd_kernels
    tar xf linux-5.10.60.tar.xz --directory=debugk/
  4. Switch to the debug kernel directory and setup a starting point for kernel config – via the localmodconfig approach – just as we did for the production kernel. This time though, the config will be based on that of our custom production kernel, as that’s what is running right now on the system:

    cd ~/lkd_kernels/debugk/linux-5.10.60
    lsmod > /tmp/lsmod.now
    make LSMOD=/tmp/lsmod.now localmodconfig
  5. As this is a debug kernel, we now configure it with the express purpose of turning on the kernel’s debug infrastructure as much as is useful. (Though we do not care that much for performance and/or security, the fact is that as we’re inheriting the config from the production kernel, the security features are enabled by default).

    The interface we use to configure our debug kernel is the usual one:

    make menuconfig

Much (if not most) of the kernel debug infrastructure can be found in the last main menu item here – the one named Kernel hacking:

Figure 1.5 – Screenshot: make menuconfig / Kernel hacking – the majority of kernel debug options live here

There are just too many kernel configs relating to debugging to discuss individually here and now; several of them will be an important kernel debug feature that we shall explain and make use of in the chapters that follow. The following table summarizes some of the kernel config variables that we set or clear, depending on whether the config is for the debug or the production kernel. It is by no means exhaustive.

Not all of the config changes we make are within the Kernel hacking menu; others are changed as well (see the merged rows in the table that specify from which menu they originate as well as the Kconfig file(s) that they originate from). Further, the <D> in the Typical value … columns indicates that the decision is left to you (or the platform/BSP team) as the particular value to use does depend on the actual product or project, it’s High Availability (HA) characteristics, security posture, and so on.

Tip

You can search within the make menuconfig UI for a given config variable (CONFIG_XXX) by typing the key / (just as in vi!) and then typing the string to search for.

Kernel Config item Meaning in brief Typical value on production kernel Typical value on debug kernel
General setup : init/Kconfig
CONFIG_LOCALVERSION Append a string to kernel version; useful (for example, -kdbg01 ) <D> <D>
CONFIG_IKCONFIG Allow complete kernel config to be stored in-kernel; can extract via scripts/extract-ikconfig (or, see the next one); very useful On (as m: module) On
CONFIG_IKCONFIG_PROC Access the in-kernel config file via /proc/config.gz (for example, extract with gunzip –c /proc/config.gz ) On On
CONFIG_KALLSYMS_ALL Loads all symbols into the kernel image <D> On
Processor type and features : arch/<arch>/Kconfig
CONFIG_CRASH_DUMP Enable a crash-dump capable kernel, triggered via the kexec() on kernel bug/Oops <D> On
CONFIG_RANDOMIZE_BASE The Kernel Address Space Layout Randomization ( KASLR ) feature support; randomizes the physical address at which the kernel image is decompressed […] as a security feature that deters exploit attempts ... On Off
General architecture-dependent options : arch/Kconfig
CONFIG_KPROBES General architecture-dependent options / Kprobes: allow to hook into almost any kernel function/address; useful for non-intrusive instrumentation and debug <D> On
CONFIG_STACKPROTECTOR_STRONG Intelligently add stack-protection canary logic via compiler; useful to detect Buffer overFlow ( BoF ) attacks <D> On
CONFIG_ARCH_MMAP_RND_BITS Number of bits to use to determine the random offset to the base address of process memory regions; higher is good for security 32 28
CONFIG_VMAP_STACK Enable vmapped ( vmalloc() allocated) kernel stacks with guard pages On On
Executable file formats / CONFIG_COREDUMP Enable core dumping <D> On
Enable loadable module support : init/Kconfig
CONFIG_MODULE_SIG_FORCE Only loads modules with a valid signature; used with a kernel lockdown LSM <D> Off
CONFIG_MODULE_SIG_ALL Auto sign all kernel modules during the make modules_install step On On
CONFIG_UNUSED_SYMBOLS Enable unused but exported kernel symbols; a bridge that should soon get removed Off On
Device Drivers / Network device support: drivers/net/Kconfig
CONFIG_NETCONSOLE Network console (netconsole) logging support; log kernel printk’s over the network On On
CONFIG_NETCONSOLE_DYNAMIC Ability to dynamically reconfigure logging targets <D> On
Kernel Hacking : lib/Kconfig.debug, lib/Kconfig.*
printk and dmesg options : lib/Kconfig.debug
CONFIG_DYNAMIC_DEBUG Dynamic debug feature for debug printk + logging; (this auto-selects the core CONFIG_DYNAMIC_DEBUG_CORE feature as well) On On
Compile-time checks and compiler options : lib/Kconfig.debug
CONFIG_DEBUG_INFO compile the kernel and modules with debug info ( gcc –g ) Off On
CONFIG_DEBUG_BUGVERBOSE BUG() prints filename and line number + instruction pointer register and Oops trace <D> On
CONFIG_DEBUG_INFO_BTF Generates dedup-ed B PF Type Format ( BTF ) info; useful for running eBPF in future (requires pahole v1.16 or later installed [1]) <D> On
Generic Kernel Debugging Instruments : lib/Kconfig[.debug|.kgdb|.ubsan]
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE Enable all magic SysRq functionality by setting this bitmask ( 0x0=off , 0x1=all on, default is 0x01b6 ) <D> 0x1
CONFIG_DEBUG_FS_DISALLOW_MOUNT Debugfs additional protection layer on production; with this, API works but the filesystem isn’t visible (not mounted) On Off
CONFIG_KGDB Remote debug kernel via GDB Off On
CONFIG_UBSAN Enables the Undefined Behavior sanity checker <D> On
Memory Debugging : lib/Kconfig[.debug|.kasan|.kgdb]
CONFIG_DEBUG_PAGEALLOC Page memory allo 192.168.1.20 cs are tracked; can help in detecting some types of memory corruption Off On
CONFIG_DEBUG_WX Warn on any W+X memory mappings seen at boot (write-able memory should, in general, not be executable, that is, W^X should apply) On On
CONFIG_DEBUG_KMEMLEAK Kernel memory leak detector Off On
CONFIG_SCHED_STACK_END_CHECK Stack overrun check when schedule() is called; minimal runtime overhead On On
CONFIG_KASAN Enable Kernel Address SANitizer – adds compile-time instrumentation; extremely useful in catching many memory bugs (OOB, UAF, and so on) Off On
Debug Oops, Lockups and Hangs : lib/Kconfig.debug
CONFIG_PANIC_ON_OOPS Enablers panic on any Oops (kernel bug) On Off
CONFIG_PANIC_TIMEOUT Timeout (seconds) after which system reboots; requires arch-level reboot support. Value n == 0 => wait forever, n >= 1 => reboot after ‘n’ seconds <D> 0
Lock debugging (spinlocks, mutexes, etc...) : lib/Kconfig.debug
CONFIG_PROVE_LOCKING Prove locking correctness via the very sophisticated lockdep lock validator; also detects possibility of deadlock Off On
CONFIG_LOCK_STAT Track lock contention code regions Off On
CONFIG_DEBUG_ATOMIC_SLEEP Noisy warnings on any sleep performed within an atomic section of code Off On
<Various> : lib/Kconfig.debug
CONFIG_BUG_ON_DATA_CORRUPTION Debug kernel data structures: If data corruption found in a kernel structure, call BUG() On On
CONFIG_DEBUG_CREDENTIALS Debug credential management: Debug checks to struct cred; useful for security as well On On
CONFIG_LATENCYTOP Latency measuring infrastructure: Enable to use latencyTOP tool <D> On
CONFIG_STRICT_DEVMEM Filter access to /dev/mem : If Off, userspace apps can memory map any memory region – user and kernel On On
Tracers : kernel/trace/Kconfig
CONFIG_FUNCTION_TRACER Enable Ftrace – tracing of every kernel function; disabled at runtime by default <D> On
CONFIG_FUNCTION_GRAPH_TRACER Enables tracing of the call graph as well; useful to profile/debug <D> On
CONFIG_DYNAMIC_FTRACE Dynamic function tracing; no performance overhead when function tracing is disabled (the default) <D> On
<Early printk> : arch-dependent ; CONFIG_EARLY_PRINTK on x86 x86 Debugging: Useful to see early kernel printk’s before console device is initialized Off On
Kernel Testing and Coverage : lib/Kconfig*, lib/kunit/Kconfig
CONFIG_FAULT_INJECTION Enable the kernel’s fault-injection framework Off On
Security : security/Kconfig, security/*/Kconfig*
CONFIG_SECURITY_DMESG_RESTRICT If On, only root can read kernel printk’s via dmesg(1) On Off
LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY Kernel in lockdown (via an LSM), mode set to confidentiality; if On, modules can’t be (un)loaded On Off
Table 1.1 – Summary of a few kernel config variables, meaning, and value

Besides the <D> value, the other values shown in the preceding table are merely my recommendation; they may or may not be suitable for your particular use case.

[1] Installing pahole v1.16 or later: pahole is part of the dwarves package. However, on Ubuntu 20.04 (or older) it’s version 1.15 which causes the kernel build - when enabled with CONFIG_DEBUG_INFO_BTF – to fail. This is as pahole version 1.16 or later is required. To address this on Ubuntu 20.04, we’ve provided the v1.17 Debian package in the root of the GitHub source tree. Install it manually as follows:

sudo dpkg –i dwarves_1.17-1_amd64.deb

Being able to view (query) the currently running kernel’s configuration can prove to be a very useful thing, especially on production systems. This can be done by looking up (grepping) /proc/config.gz (a simple zcat /proc/config.gz | grep CONFIG_<FOO> is typical). The pseudo-file /proc/config.gz contains the entire kernel config (it’s practically equivalent to the .config within the kernel source tree). Now, this pseudo-file is only generated by setting CONFIG_IKCONFIG=y. As a safety measure on production systems, we set this config to the value m on production, implying that it’s available as a kernel module (called configs). Only once you load this up does the /proc/config.gz file become visible; and of course, to load it up you require root access...

Here’s an example of loading it up and then querying the kernel config (for this very feature!):

$ ls -l /proc/config.gz
ls: cannot access '/proc/config.gz': No such file or directory

Ok, to begin with (on production) it doesn’t show up. So do this:

$ sudo modprobe configs
$ ls -l /proc/config.gz
-r--r--r-- 1 root root 34720 Oct  5 19:35 /proc/config.gz
$ zcat /proc/config.gz |grep IKCONFIG
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y

Ah, it now works just fine!

Food for thought

Did you notice? In Table 1.1, I’ve set the production kernel’s value for CONFIG_KALLSYMS_ALL as <D>, implying it’s up to the system architects to decide whether to keep it On or Off. Why? Shouldn’t all kernel symbols be disabled (off) in a production system? Well, that’s the common decision. Recall our brief on the Mars Pathfinder mission – where it initially failed due to a priority inversion issue. The tech lead of the software team at JPL, Glenn Reeves, made a very interesting statement in his now famous response to Mike Jones (https://www.cs.unc.edu/~anderson/teach/comp790/papers/mars_pathfinder_long_version.html): The software that flies on Mars Pathfinder has several debug features within it that are used in the lab but are not used on the flight spacecraft (not used because some of them produce more information than we can send back to Earth). These features were not "fortuitously" left enabled but remain in the software by design. We strongly believe in the "test what you fly and fly what you test" philosophy.

Sometimes, keeping debug features (and of course, logging) turned on in the production version of the system, can be immensely helpful!

For now, don’t stress too much on exactly what each of these kernel debug options mean and how you’re to use them; we shall cover most of these kernel debug options in the coming chapters. The entries in Table 1.1 are meant to kickstart the configuration of your production and debug kernels and get a brief idea regarding their effect.

Once you’re done generating the new debug kernel config, let’s back it up as follows:

cp –af .config ~/lkd_kernels/kconfig_dbg01

Build it, as before make –j8 all (Adjust the parameter to –j based on the number of CPU cores on your box). When done, check out the compressed and uncompressed kernel image files:

$ ls -lh arch/x86/boot/bzImage vmlinux
-rw-r--r-- 1 letsdebug letsdebug  18M Aug 20 12:35 arch/x86/boot/bzImage
-rwxr-xr-x 1 letsdebug letsdebug 1.1G Aug 20 12:35 vmlinux
$ 

Did you notice? The size of the vmlinux uncompressed kernel binary image file is huge; how come? All the debug features plus all the kernel symbols account for this large size...

Finish off with installing the kernel modules, initramfs, and bootloader update, as earlier:

sudo make modules_install && sudo make install

Great; now that you’re done configuring both the production and debug kernels, lets briefly examine the difference between the configurations.

Seeing the difference – production and debug kernel config

It’s enlightening – and really, it’s the key thing within this particular topic - to see the differences between our original production and the just-built debug kernel configuration. This is made easy via the convenience script scripts/diffconfig; from within the debug kernel source tree, simply do this to generate the diff:

 scripts/diffconfig ~/lkd_kernels/kconfig_prod01 ~/lkd_kernels/kconfig_dbg01 >  ../../kconfig_diff_prod_to_debug.txt

View the output file in an editor, seeing for yourself the changes we wrought in configuration. There are indeed many deltas – on my system, the diff file exceeds 200 lines. Here’s a partial look of the same on my system (I use the ellipse [ … ] to denote skipping some output):

$ cat kconfig_diff_prod_to_debug.txt
-BPF_LSM y
-DEFAULT_SECURITY_APPARMOR y
-DEFAULT_SECURITY_SELINUX n
-DEFAULT_SECURITY_SMACK n
[ … ]

The - (minus sign) prefixing each of the above lines indicates that we removed this kernel config feature from the debug kernel. Continuing with the output:

DEBUG_ATOMIC_SLEEP n -> y
DEBUG_BOOT_PARAMS n -> y
DEBUG_INFO n -> y
DEBUG_KMEMLEAK n -> y
DEBUG_LOCK_ALLOC n -> y
DEBUG_MUTEXES n -> y
DEBUG_PLIST n -> y
DEBUG_RT_MUTEXES n -> y
DEBUG_RWSEMS n -> y
DEBUG_SPINLOCK n -> y
[ … ]

In the preceding code snippet, you can clearly see the change made from the production to the debug kernel; for example, the first line tells us that the kernel config named DEBUG_ATOMIC_SLEEP was disabled in the production kernel and we’ve no enabled it (n->y) in the debug kernel! (Note that it will be prefixed with CONFIG_, that is, it will show up as CONFIG_DEBUG_ATOMIC_SLEEP in the kernel config file itself).

Here, we can see how the suffix to the name of the kernel – the config directive named CONFIG_LOCALVERSION – has been changed between the two kernels, besides other things:

LKDTM n -> m
LOCALVERSION "-prod01" -> "-dbg01"
LOCK_STAT n -> y
MMIOTRACE n -> y
MODULE_SIG y -> n
[ … ]

The + prefix to each line indicates the feature that has been added to the debug kernel:

+ARCH_HAS_EARLY_DEBUG y
+BITFIELD_KUNIT n
[ … ]
+IKCONFIG m
+KASAN_GENERIC y
[ … ]

In closing, it’s important to realize:

  • The particulars of the kernel configuration we’re performing here – for both our production and debug kernels - is merely representative; your project or product requirements might dictate a different config
  • Many, if not most, modern embedded Linux projects typically employ a sophisticated builder tool or environment; Yocto and Buildroot are two common de facto examples. In such cases, you will have to adapt the instructions given here to cater to using these build environments (in the case of Yocto, this can become a good deal of work in specifying an alternate kernel configuration via a BB-append-style recipe).

By now, am furtively hoping you’ve absorbed this material and indeed, built yourself two custom kernels – a production and a debug one. If not, I request you to please do so before proceeding further along.

So, great, well done! By now, you have both custom 5.10 LTS production and debug kernels ready to rip. We’ll certainly make use of them in the coming chapters. Let’s finish this chapter with a few debug ‘tips’ that I hope you’ll find useful.

 

Debugging – a few quick tips

I’ll start off by saying this: debugging is both a science and an art, refined with experience; the mundane hands-on slogging through to identify a bug and it’s root cause, and (possibly) to fix it. I’m of the opinion that the following few debug tips are really nothing new; that said, we do tend to get caught up in the moment and often miss the obvious. I hope you’ll find these useful and return to these tips time and again!

  • Assumptions - just say NO!

    Churchill famously said “Never, never, never, give up”. We say “Never, never, never, make assumptions”.

    Assumptions are, very often, the root cause behind many, many bugs and defects. Think back, re-read the section Software bugs – a few actual cases!

    In fact (hey, am partially joking here), just look at the word ‘ASSUME’: it just begs to say: “Don’t make an ASS out of U and ME” !

    Using assertions in your code (we shall cover this), is a great way to catch assumptions.

  • Don’t lose the forest for the trees!

    At times, we do get lost in the twisted mazes of complex code paths; in these circumstances, it’s really easy to lose sight of the large idea, the objective of the code. Try and zoom out, think of the bigger picture. It often helps spot the faulty assumption(s) that led to the error(s). Well written documentation can be a life saver.

  • Think small: When faced with a difficult bug, try this: build / configure / get the smallest possible version of your problem (statement) to execute causing the issue or bug you’re currently facing to surface; this often helps you track down the root cause of the problem. In fact, very often (in my own experience), the mere act of doing this – or even just the detailed jotting down of the problem you face – triggers your seeing the actual issue and it’s solution in your mind!
  • “It requires twice the brain power to debug a piece of code than to write it”: This paraphrased quote is by Brian Kernighan in the book The Elements of Programming Style. So, should we not use our full brain power while writing code? Ha, of course you should... But, debugging is typically harder than writing code. The real point is this: take the trouble to first carefully do your groundwork; write a brief very high-level design document, write what you expect the code to do, at a high level of abstraction. Then move into the specifics (with a so-called Low Level Design doc). Good documentation will save you one day (and blessings shall be showered upon you!).

    Reminds me of another quote: An ounce of design is worth a pound of refactoring, Karl Wiegers.

  • Employ a “Zen Mind, Beginner’s Mind”: Sometimes, the code can become too complex (spaghetti-like; it just smells). In many cases, just giving up and starting from scratch again, if viable, is perhaps the best thing to do.

    This Zen - Beginner’s Mind also implies that we at least temporarily stop our (perhaps over-egotistical) thought patterns (I wrote this so well, how can it be wrong!?) and look at the situation from the point of view of somebody completely new to it. A good night’s rest can do wonders.

    It is, in fact, one key reason why a colleague reviewing your code can spot bugs you’d never see!

  • Variable naming, comments: I recall a Q&A on Quora revealing that the hardest thing a programmer does is naming variables well! This is truer than it might appear at first glance. Variable names stick; choose yours carefully. As with commenting, don’t go overboard either: a local variable for a loop index? int i is just fine (int theloopindex is just painful). The same goes for comments: they’re there to explain the rationale, the design, behind the code, what it’s designed and implemented to achieve, not how the code works. Any competent programmer can figure that out.
  • Ignore logs at your peril! It's self-evident perhaps, but we can often miss the obvious when under pressure... carefully checking kernel (and even app) logs often reveals the source of the issue you might be facing. Logs are usually able to be displayed in reverse-chronological order and give you a view of what actually occurred; Linux’s systemd journalctl(1) utility is powerful; learn to leverage it!
  • Testing can reveal the presence of errors but not their absence: A truism, unfortunately. Still, testing and QA is simply one of the most critical parts of the software process; ignore it at your peril! Time and the trouble taken to write exhaustive test cases – both positive and negative – pays off large dividends in the long run, helping make the product or project a grand success. Negative test cases and fuzzing are critical for exposing (and subsequently fixing) security vulnerabilities in the codebase. One hundred percent code coverage is the objective!
  • Incurring technical debt: Every now and then, you realize deep down that though what you’ve coded works, it’s not been done well enough (perhaps there still exist corner cases that will trigger bugs or undefined behavior); that nagging feeling that perhaps this design and implementation simply isn’t best. The temptation to quickly check it in and hope for the best can be high, especially as deadlines loom! Please don’t; there is really a thing called technical debt. It will come and get you.
  • Silly mistakes: If I had a penny for each time I’ve made really silly mistakes when developing code, I’d be a rich man! For instance, I once spent nearly half a day racking my head on why my C program would just refuse to work correctly; until I realized am editing the correct code but compiling an old version of it – I was performing the build in the wrong directory! (I am certain you’ve faced your share of such frustrations). Often, a break, a good night’s sleep, can do wonders.
  • Empirical model: The word empirical means to validate something (anything) by actual and direct observation or experience rather than relying on theory. So, don’t believe the book (this one is an exception of course!), don’t believe the tutorial, the tutor or author: try it out and see for yourself!

Years (decades, actually) back, on my very first day of work at a company I joined, a colleague emailed me a document that I still hold dear: The Ten Commandments for C Programmers, by Henry Spencer (https://www.electronicsweekly.com/open-source-engineering/linux/the-ten-commandments-for-c-programmers-2009-04/). Do check it out. In a clumsily similar manner, I present a quick checklist for you.

A programmer’s checklist – seven rules

Very important! Did you remember to:

  • Check all APIs for their failure case

    Compile with warnings on (definitely with -Wall and possibly -Wextra) and eliminate all warnings as far as is possible

  • Never trust (user) input; validate it
  • Eliminate unused (or dead) code from the codebase immediately
  • Test thoroughly; 100% code coverage is the objective. Take the time and trouble to learn to use powerful tools: memory checkers, static and dynamic analyzers, security checkers (checksec, lynis, and several others), fuzzers, code coverage tools, fault injection frameworks, and so on. Don’t ignore security!
  • With regard to kernel and especially drivers, after eliminating software issues, be aware that (peripheral) hardware issues could be the root cause of the bug, don’t discount it out of hand! (One learns this the hard way)
  • Do not assume anything (ASSUME: makes an ASS out of U and ME); using assertions helps catch assumptions, and thus bugs

We shall certainly be elaborating on several of these points in the coming material.

 

Summary

Firstly, congratulations on completing this, our first chapter; getting started is half the battle! You began by learning a bit about how the word debug came to be – equal parts myth, legend and truth...

A key section was the brief description of some complex real-world cases of software gone wrong (several of them very unfortunate tragedies), where a software bug (or bugs) proved to be a key factor behind the disaster.

You understood that we’re using the latest (at the time of this writing) 5.10 LTS kernel and how to setup the workspace (on x86_64, using either a native Linux system or Linux running as a guest OS). We covered the configuring and building of two custom kernels – a production and a debug one, with the production kernel geared towards high performance and security whereas the debug one is configured with several (most) kernel debug features turned on, in order to help catch bugs. I will assume you’ve done this for yourself, as future chapters will depend on it.

Finally, and I think very important, a few debugging tips and a small checklist, wrapped up this chapter. I urge you to read through the tips and checklist often.

In the next chapter, you will learn the basics of user mode debugging, how to use a few very useful tools, instrument your code and leverage a better Makefile. See you there soon!

 

Further reading

About the Author

  • Kaiwan N Billimoria

    Kaiwan N Billimoria taught himself BASIC programming on his dad's IBM PC back in 1983. He was programming in C and Assembly on DOS until he discovered the joys of Unix, and by around 1997, Linux!

    Kaiwan has worked on many aspects of the Linux system programming stack, including Bash scripting, system programming in C, kernel internals, device drivers, and embedded Linux work. He has actively worked on several commercial/FOSS projects. His contributions include drivers to the mainline Linux OS and many smaller projects hosted on GitHub. His Linux passion feeds well into his passion for teaching these topics to engineers, which he has done for well over two decades now. He's also the author of Hands-On System Programming with Linux. It doesn't hurt that he is a recreational ultrarunner too.

    Browse publications by this author
Linux Kernel Debugging
Unlock this book and the full library FREE for 7 days
Start now