Writing a Simple misc Character Device Driver

No doubt, device drivers are a vast and interesting topic. Not only that, they are perhaps the most common use of the Loadable Kernel Module (LKM) framework that we have been using. Here, we shall introduce you to writing a few simple yet complete Linux character device drivers, within a class called misc; yes, that's short for miscellaneous. We wish to emphasize that this chapter is limited in its scope and coverage - here, we do not attempt to delve into the deep details regarding the Linux driver model and its many frameworks; instead, we refer you to several excellent books and tutorials on this topic via the Further reading section for this chapter. Our aim here is to quickly get you familiar with the overall concepts behind writing a simple character device driver.

Having said that, this book indeed has several chapters that are dedicated to what a driver author needs to know. Besides this introductory chapter, we cover (in detail) how a driver author works with hardware I/O memory, hardware interrupt handling (and its many sub-topics), and kernel mechanisms such as delays, timers, kernel threads, and work queues. Use of various user-kernel communication pathways or interfaces is covered in detail as well. The final two chapters of this book then focus on something very important for any kernel development, including drivers – kernel synchronization.

The other reasons we'd prefer to write a simple Linux character device driver and not just our "usual" kernel module are as follows:

Until now, our kernel modules have been quite simplistic, having only init and cleanup functions, nothing more. A device driver provides several entry points into the kernel; these are the file-related system calls, known as the driver's methods. So, we can have an open() method, a read() method, a write() method, an llseek() method, an [unlocked|compat]_ioctl() method, a release() method, and so on.

FYI, all possible "methods" (functions) the driver author can hook into are in this key kernel data structure: include/linux/fs.h:file_operations (more on this in the Understanding the connection between the process, the driver, and the kernel section).

This situation is simply more realistic, and more interesting.

In this chapter, we will cover the following topics:

Getting started with writing a simple misc character device driver
Copying data from kernel to user space and vice versa
A misc driver with a secret
Issues and security concerns

Technical requirements

I assume that you have gone through the Preface section To get the most out of this book, and have appropriately prepared a guest VM running Ubuntu 18.04 LTS (or a later stable release) and installed all the required packages. If not, I highly recommend you do this first. To get the most out of this book, I strongly recommend you first set up the workspace environment, including cloning this book's GitHub repository for the code, and work on it in a hands-on fashion. The repository can be found here: https://github.com/PacktPublishing/Linux-Kernel-Programming-Part-2.

Getting started with writing a simple misc character device driver

In this section, you will first learn the required background material – understanding the basics of the device file (or node) and its hierarchy. After that, you will learn – by actually writing the code of a very simple misc character driver – the kernel framework behind the raw character device driver. Along the way, we shall cover how to create the device node(s) and test the driver via a user space app. Let's get started!

Understanding the device basics

Some quick background is in order.

A device driver is the interface between the OS and a peripheral hardware device. It can be written inline – that is, compiled within the kernel image file – or, more commonly, written outside of the kernel source tree as a kernel module (we covered the LKM framework in detail in the companion guide Linux Kernel Programming, Chapter 4, Writing Your First Kernel Module – LKMs Part 1, and Chapter 5, Writing Your First Kernel Module – LKMs Part 2). Either way, the driver code certainly runs at OS privilege, in kernel space (user space device drivers do exist, but can suffer performance issues; while useful in many circumstances, we don't cover them here. Take a look at the Further reading section).

In order for a user space application to gain access to the underlying device driver within the kernel, some I/O mechanism is required. The Unix (and thus Linux) design is to have the process open a special type of file – a device file, or device node. These files typically live in the /dev directory, and on modern systems are dynamic and auto-populated. The device node serves as an entry point into the device driver.

In order for the kernel to distinguish between device files, it uses two attributes within their inode data structure:

The type of file – either character (char) or block
The major and minor number

You will see that the namespace – the device type and the {major#, minor#} pair – form a hierarchy. Devices (and thus their drivers) are organized within a tree-like hierarchy within the kernel (the driver core code within the kernel takes care of this). The hierarchy is first divided based on device type – block or char. Within that, we have some n major numbers for each type, and each major number is further classified via some m minor numbers; Figure 1.1 shows this hierarchy.

Now, the key difference between block and character devices is that block devices have the (kernel-level) capability to be mounted and thus become part of the user-accessible filesystem. Character devices cannot be mounted; thus, storage devices tend to be block-based. Think of it this way (a bit simplistic but useful): if the (hardware) device is not storage, nor a network device, then it's a character device. A huge number of devices fall into the 'character' class, including your typical I2C/SPI (Inter Integrated Circuit / Serial Peripheral Interface) sensor chips (temperature, pressure, humidity, and so on), touchscreens, Real-Time Clock (RTC), media (video, camera, audio), keyboards, mice, and so on. USB forms a class within the kernel for infrastructure support. USB devices can be block devices (pen drives, USB disks), character devices (mice, keyboard, camera) or network (USB dongles) devices.

From 2.6 Linux onward, the {major:minor} pair is a single unsigned 32-bit quantity within the inode, a bitmask (it's the dev_t i_rdev member). Of these 32 bits, the MSB 12 bits represent the major number and the remaining LSB 20 bits represent the minor number. A quick calculation shows that there can therefore be up to 2¹² = 4,096 major numbers and 2²⁰, which is one million, minor numbers per major number. So, glance at Figure 1.1; within the block hierarchy, there are a possible 4,096 majors, each of which can have up to 1 million minors. Similarly, within the character hierarchy, there are a possible 4,096 majors, each of which can have up to 1 million minors:

Figure 1.1 – The device namespace or hierarchy

You may be wondering: what exactly does this major:minor number pair really mean? Think of the major number as representing the class of the device (is it a SCSI disk, a keyboard, a teletype terminal (tty) or pseudo-terminal (pty) device, a loopback device (yes, these are pseudo-hardware devices), a joystick, a tape device, a framebuffer, a sensor chip, a touchscreen, and so on?). There's indeed an enormous range of devices; to get a sense of just how many, we urge you to check out the kernel documentation here: https://www.kernel.org/doc/Documentation/admin-guide/devices.txt (it's literally the official registry of all available devices for the Linux OS. It's formally called the LANANA – the Linux Assigned Names And Numbers Authority! Only these folks can officially assign the device node – the type and major:minor numbers – to devices).

The minor number's meaning (interpretation) is left completely to the driver author; the kernel does not interfere. Typically, the driver interprets the device's minor number to represent either a physical or logical instance of the device, or to represent a certain functionality. (For example, the Small Computer System Interface (SCSI) driver – of type block, major #8 – uses minor numbers to represent logical disk partitions for up to 16 disks. On the other hand, character major #119 is used by VMware's virtual network control driver. Here, the minors are interpreted as the first virtual network, second virtual network, and so on.) Similarly, all drivers themselves assign meaning to their minor numbers. But every good rule has an exception. Here, the exception to the rule - that the kernel doesn't interpret the minor number – is the misc class (type character, major #10). It uses the minor numbers as second-level majors. This will be covered in the following section.

A common problem is that of the namespace getting exhausted. A decision taken years back "collects" various miscellaneous character devices - a lot of mice (no, not of the animal kingdom variety), sensors, touchscreens, and so on - into one class called the misc or 'miscellaneous' class, which is assigned character major number 10. Within the misc class live a lot of devices and their corresponding drivers. In effect, they share the same major number and rely on a unique minor number to identify themselves. We shall write a few drivers using precisely this class and leveraging the kernel's 'misc' framework.

Many devices have already been assigned via the LANANA (Linux Assigned Names And Numbers Authority) into the misc character device class. Figure 1.2 shows a partial screenshot from https://www.kernel.org/doc/Documentation/admin-guide/devices.txt showing the first few misc devices, their assigned minor numbers, and a brief description. Do see the reference link for the full list:

Figure 1.2 – Partial screenshot of misc devices: char type, major # 10

In Figure 1.2, the leftmost column has 10 char, specifying that it's assigned major # 10 under the character type of the device hierarchy (Figure 1.1). The columns to the right are in the form minor# = /dev/<foo> <description>; quite obviously, this is the minor number assigned followed by (after the = sign) the device node and a one-line description.

A quick note on the Linux Device Model

Without going into great detail, a quick overview of the modern unified Linux Device Model (LDM) is important. Modern Linux, from the 2.6 kernel onward, has a fantastic feature, the LDM, which achieves many goals to do with the system and the devices on it in one broad and bold stroke. Among its many features, it creates a complex hierarchical tree unifying system components, all peripheral devices, and their drivers. This very tree is exposed to user space via the sysfs pseudo-filesystem (analogous to how procfs exposes some kernel and process/thread internal details to user space) and is typically mounted under /sys. Within /sys, you will find several directories – you can consider them to be "viewports" into the LDM. On our x86_64 Ubuntu VM, we show the sysfs filesystem mounted under /sys:

$ mount | grep -w sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

Furthermore, take a peek inside:

$ ls -F /sys/
block/ bus/ class/ dev/ devices/ firmware/ fs/ hypervisor/ kernel/ module/ power/

Think of these directories as viewports into the LDM – different ways of viewing the devices on the system. Of course, as things evolve, more tends to get in than get out (the bloat aspect!). Several non-obvious directories have now made their way in here. Though (as with procfs) sysfs is officially documented as an Application Binary Interface (ABI) interface, that's subject to change/deprecation at any time; the reality is that this system is there to stay – and evolve, of course – over time.

The LDM, a bit simplistically, can be thought of as having – and tying together – these major components:

The buses on the system.
The devices on them.
The device drivers that drive the devices (also often referred to as client drivers).

A fundamental LDM tenet is that every single device must reside on a bus. This might seem obvious: USB devices will be on the USB bus, PCI devices on the PCI bus, I2C devices on the I2C bus, and so on. Thus, under the /sys/bus hierarchy, you will be able to literally "see" all the devices via the buses that they reside on:

Figure 1.3 – The different buses or bus driver infrastructure on modern Linux (on an x86_64)

The kernel's driver core provides bus drivers (that are (typically) either part of the kernel image itself or auto-loaded at boot as required), which, of course, makes the buses do their job. What is their job? Critically, they organize and recognize the devices on them. If a new device surfaces (perhaps you plugged in a pen drive), the USB bus driver will recognize the fact and bind it to its (USB mass storage) device driver! Once successfully bound (many terms are used to describe this: bound, enumerated, discovered), the kernel driver framework invokes the registered probe() method (function) of the driver. This probe method now sets up the device, allocating resources, IRQs, memory setup, registering it as required, and so on.

Another key aspect to understand regarding the LDM is that the modern LDM-based driver should typically do the following:

Register itself to a (specialized) kernel framework.
Register itself to a bus.

The kernel framework it registers itself to depends on the type of device you are working with; for example, a driver for an RTC chip that resides on the I2C bus will register itself to the kernel's RTC framework (via the rtc_register_device() API) and to the I2C bus (internally via the i2c_register_driver() API). On the other hand, a driver for a network adapter (a NIC) on the PCI bus will typically register itself to the kernel's network infrastructure (via the register_netdev() API) and the PCI bus (via the pci_register_driver() API). Registering with a specialized kernel framework makes your job as a driver author a lot easier – the kernel will often provide helper routines (and even data structures) to take care of I/O details, and so on. For example, take the previously mentioned RTC chip driver.

You needn't know the details of how to communicate with the chip over the I2C bus, bit banging out data on the Serial Clock (SCL)/Serial Data (SDA) lines as the I2C protocol demands. The kernel I2C bus framework provides you with convenience routines (such as the typically used i2c_smbus_*() APIs) that let you quite effortlessly communicate over the bus to the chip in question!

If you're wondering how to get more information on these driver APIs, here's the good news: the official kernel documentation has plenty to offer. Do look up The Linux driver implementer’s API guide here: https://www.kernel.org/doc/html/latest/driver-api/index.html.

(We do show some examples of the probe() method of a driver in the following two chapters; until then, patience, please.) Conversely, when the device is detached from the bus or the kernel module is unloaded (or the system is shutting down), the detach causes the driver's remove() (or disconnect()) method to be invoked. Between these, the work of the device via its drivers (both bus and client) is carried out!

Please note that we are glossing over a lot of the inner details here, as they are beyond the scope of this book. The point is to give you a conceptual understanding of the LDM. Do refer to the articles and links in the Further reading section for more detailed information.

Here, we wish to keep our driver coverage very simple and minimal, focusing more on the underlying basics. Hence we have chosen to write a driver that uses perhaps the simplest kernel framework – the misc or miscellaneous kernel framework. In this case, the driver doesn't even need to explicitly register with any bus (driver). In fact, it's more like this: our driver works directly on the hardware without the need for any particular bus infrastructure support.

In our particular example using the misc kernel framework, since we don't explicitly register with any bus (driver), we don't even require the probe()/remove() methods. This keeps things simple. On the other hand, once you have understood this simplest of drivers, I encourage you to go further and look at writing device drivers with the typical kernel framework registration plus bus driver registration, thus employing the probe()/remove() methods. A good way to get started is to learn how to write a simple platform driver, registering it with the kernel's misc framework and the platform bus, a pseudo-bus infrastructure that supports devices that do not physically reside on any physical bus (this is more common than you might at first imagine; several peripherals built into a modern System on Chip (SoC) are not on any physical bus, and thus their drivers are typically platform drivers). To get started, look under the kernel source tree in drivers/ for code invoking the platform_driver_register() API. The official kernel documentation here covers platform devices and drivers: https://www.kernel.org/doc/html/latest/driver-api/driver-model/platform.html#platform-devices-and-drivers.

As additional help, note the following:
- Do refer to Chapter 2, User-Kernel Communication Pathways, particularly the Creating a simple platform device and Platform devices sections.
- An exercise (see the Questions section) for this chapter is to write such a driver. I have provided a sample (and very simple) implementation here: solutions_to_assgn/ch12/misc_plat/.

We do, however, require the kernel's misc framework support, and thus we register ourselves with it. Next, it's also key to understand this: our driver is a logical one, in the sense that there's no actual physical device or chip that it's driving. This is quite often the case (of course, you could say that here, the hardware being worked upon is RAM).

So, if we are to write a Linux character device driver belonging to this misc class, we will first need to register ourselves to it. Next, we will be in need of a unique (unused) minor number. Again, there is a way to have the kernel dynamically assign a free minor number to us. The following section covers these aspects and more.

Writing the misc driver code – part 1

Without further ado, let's look at the code to write a simple skeleton character misc device driver! (Well, snippets of the actual code; as always, I strongly advise you to git clone the book's GitHub repository, view it in detail, and try out the code yourself.)

Let's go through it step by step: in the init code of our first device driver (using the LKM framework), we must first register our driver with the appropriate Linux kernel's framework; in this case, with the misc framework. This is done via the misc_register() API. It takes one parameter, a pointer to a data structure of type miscdevice, which describes the miscellaneous device we are setting up:

// ch1/miscdrv/miscdrv.c
#define pr_fmt(fmt) "%s:%s(): " fmt, KBUILD_MODNAME, __func__
[...]
#include <linux/miscdevice.h>
#include <linux/fs.h>              /* the fops, file data structures */
[...]

static struct miscdevice llkd_miscdev = {
    .minor = MISC_DYNAMIC_MINOR, /* kernel dynamically assigns a free minor# */
    .name = "llkd_miscdrv",      /* when misc_register() is invoked, the kernel
             * will auto-create a device file as /dev/llkd_miscdrv ;
             * also populated within /sys/class/misc/ and /sys/devices/virtual/misc/ */
    .mode = 0666,            /* ... dev node perms set as specified here */
    .fops = &llkd_misc_fops, /* connect to this driver's 'functionality' */
};

static int __init miscdrv_init(void)
{
    int ret;
    struct device *dev;

    ret = misc_register(&llkd_miscdev);
    if (ret != 0) {
        pr_notice("misc device registration failed, aborting\n");
        return ret;
    }
    [ ... ]

In the miscdevice structure instance, we do the following:

We set the minor field to MISC_DYNAMIC_MINOR. This has the effect of requesting the kernel to dynamically assign us an available minor number (once registration is successful, this minor field gets populated with the actual minor number assigned).
We initialize the name field. On successful registration, this has the kernel framework automatically create a device node (of the form /dev/<name>) on our behalf! As expected, the type will be character, the major number will be 10, and the minor number will be a dynamically assigned value. This is (part of) the advantage of using a kernel framework; else, we might have had to devise a way to create the device node ourselves; by the way, the mknod(1) utility can create a device file when invoked with root privilege (or you have the CAP_MKNOD capability); it works by invoking the mknod(2) system call!
The permissions of the device node will be set to whatever you initialize the mode field to (here, we've deliberately kept it permissive and readable-writeable by all via the 0666 octal value).
We shall postpone the discussion of the file operations (fops) structure member to the section following this one.

All misc drivers are of the character type and use the same major number (10), but of course require unique minor numbers.

Understanding the connection between the process, the driver, and the kernel

Here, we will delve into just a bit of the kernel internals surrounding the successful registration of a character device driver on Linux. In effect, you will come to understand the workings of the underlying raw character driver framework.

The file_operations structure, or the fops (pronounced eff-opps), as it's commonly referred to, is of critical importance to driver authors; the majority of the members of the fops structure are function pointers – think of them as virtual methods. They represent all possible file-related system calls that could be issued on a (device) file. So, it has open, read, write, poll, mmap, release, and several more members (most of which are function pointers). A few of the members of this critical data structure are shown here:

// include/linux/fs.h
struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int);
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
[...]
    __poll_t (*poll) (struct file *, struct poll_table_struct *);
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
    int (*mmap) (struct file *, struct vm_area_struct *);
    unsigned long mmap_supported_flags;
    int (*open) (struct inode *, struct file *);
    int (*flush) (struct file *, fl_owner_t id); 
    int (*release) (struct inode *, struct file *);
[...]
    int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;

A key job of the driver author (or the underlying kernel framework) is to populate these function pointers, thus linking them to actual code within the driver. You needn't implement every single function, of course; please refer to the Handling unsupported methods section for details.

Now, let's assume you have written your driver to set up functions for some of the f_op methods. Once your driver is registered with the kernel, typically via a kernel framework, when any user space process (or thread) opens a device file registered to this driver, the kernel Virtual Filesystem Switch (VFS) layer will take over. Without going into deep detail, suffice it to say that the VFS allocates and initializes that process's open file data structure (struct file) for the device file. Now, recall the last line in our struct miscdevice initialization; it's this:

   .fops = &llkd_misc_fops, /* connect to this driver's 'functionality' */

This line of code has a key effect: it ties the process's file operations pointer (which is within the process' open file structure) to the device driver's file operations structure. The functionality – what the driver will do – is now set up for this device file!

Let's flesh this out. Now (after your driver has initialized itself), a user-mode process opens your driver's device file, by issuing the open(2) system call on it. Assuming all goes well (and it should), the process is now connected to your driver via the file_operations structure pointers deep inside the kernel. Here's a critical point: after the open(2) system call returns successfully, and the process issues any file-related system call foo() on that (device) file, the kernel VFS layer will, be having in an object-oriented fashion (we have pointed this out before in this book!), blindly and trustingly invoke the registered fops->foo() method! The file opened by the user space process, typically a device file in /dev, is internally represented by the struct file metadata structure (a pointer to this, struct file *filp, is passed along to the driver). So, in terms of pseudo-code, when user space issues a file-related system call foo(), this is what the kernel VFS layer effectively does:

/* pseudocode: kernel VFS layer (not the driver) */
if (filp->f_op->foo)
    filp->f_op->foo(); /* invoke the 'registered' driver method corresponding to 'foo()' */

Thus, if the user space process that opened a device file invokes the read(2) system call upon it, the kernel VFS will invoke filp->f_op->read(...), in effect, redirecting control to the device driver. Your job as the device driver author is to provide the functionality of read(2)! The same goes for all other file-related system calls. This, essentially, is how Unix and Linux implement the well-known if it's not a process, it's a file design principle.

Handling unsupported methods

You don't have to populate every member of the f_ops structure, only those that your driver supports. If that's the case, and you have populated a few methods but left out, say, the poll method, and a user space process invokes poll(2) on your device (perhaps you've documented the fact that it's not supposed to, but what if it does?), then what will happen? In cases like this, the kernel VFS, detecting that the foo pointer (in this example, poll) is NULL, returns an appropriate negative integer (in effect, following the same 0/-E protocol). The glibc code will multiply this by -1 and set the calling process's errno variable to that value, signaling that the system call failed.

Two points to be aware of:

Quite often, the negative errno value returned by the VFS isn't very intuitive. (For example, if you've set the read() function pointer of f_op to NULL, the VFS causes the EINVAL value to be sent back. This has the user space process think that read(2) failed because of an "Invalid argument" error, which simply isn't the case at all!)
The lseek(2) system call has the driver seek to a prescribed location in the file – here, of course, we mean in the device. The kernel deliberately names the f_op function pointer as llseek (notice the two 'l's). This is simply to remind you that the return value from lseek can be a 64-bit (long long) quantity. Now, for the majority of hardware devices, the lseek value is not meaningful, thus most drivers do not need to implement it (unlike filesystems). Now, the issue is this: even if you do not support lseek (you've set the llseek member of f_op to NULL), it still returns a random positive value, thus causing the user-mode app to incorrectly conclude that it succeeded. Hence, if you aren't implementing lseek, you are to do the following:
1. Explicitly set llseek to the special no_llseek value, which will cause a failure value (-ESPIPE; illegal seek) to be returned.
2. In such cases, you are to also invoke the nonseekable_open() function in your driver's open() method, specifying that the file is non-seekable (this is often called like this in the open() method: return nonseekable_open(struct inode *inode, struct file *filp);. The details, and more, are covered in the LWN articles here: https://lwn.net/Articles/97154/. You can see the changes this wrought to many drivers here: https://lwn.net/Articles/97180/).

An appropriate value to return if you aren't supporting a function is -ENOSYS, which will have the user-mode process see the error Function not implemented (when it invokes the perror(3) or strerror(3) library APIs). This is clear, unambiguous; the user space developer will now understand that your driver does not support this function. Thus, one way to implement your driver is to set up pointers to all the file operation methods, and write a routine for all file-related system calls (the f_op methods) in your driver. For the ones you do support, write the code; for the ones you do not implement, just return the value -ENOSYS. Though a bit painstaking to do, it will result in unambiguous return values to user space.

Writing the misc driver code – part 2

Armed with this knowledge, look again at the init code of ch1/miscdrv/miscdrv.c. You will see that, just as described in the previous section, we have initialized the fops member of the miscdev struct to a file_operations structure, thus setting up the functionality of the driver. The relevant code snippet (from our driver) is as follows:

static const struct file_operations llkd_misc_fops = {
    .open = open_miscdrv,
    .read = read_miscdrv,
    .write = write_miscdrv,
    .release = close_miscdrv,
};

static struct miscdevice llkd_miscdev = {
    [ ... ]
    .fops = &llkd_misc_fops,     /* connect to this driver's 'functionality' */
};

So, now you can see it: when a user space process (or thread) that has opened our device file invokes, say, a read(2) system call, the kernel VFS layer will follow the pointers (generically, filp->f_op->foo()) and invoke the function, read_miscdrv(), in effect handing over control to the device driver! How exactly the read method is written is covered in the next section.

Continuing with the init code of our simple misc driver:

    [ ... ] 
    /* Retrieve the device pointer for this device */
    dev = llkd_miscdev.this_device;
    pr_info("LLKD misc driver (major # 10) registered, minor# = %d,"
            " dev node is /dev/%s\n", llkd_miscdev.minor, llkd_miscdev.name);
    dev_info(dev, "sample dev_info(): minor# = %d\n", llkd_miscdev.minor);
    return 0;        /* success */
}

Our driver retrieves a pointer to the device structure – it's something required by every driver. Within the misc kernel framework, it's available within the this_device member of our miscdevice structure.

Next, pr_info() shows the minor number dynamically obtained. The dev_info() helper routine is more interesting: as a driver author, you are expected to use these dev_xxx() helpers when emitting printk; it will also prefix useful information about the device. The only difference in syntax between the dev_xxx() and pr_xxx() helpers is that the first parameter to the former is the pointer to the device structure.

Okay, let's get our hands dirty! We build the driver and insmod it into kernel space (we use our lkm helper script to do so):

Figure 1.4 – Screenshot of building and loading our miscdrv.ko skeleton misc driver on an x86_64 Ubuntu VM

(By the way, as you can see in Figure 1.4, I tried out this misc driver on a more recent distro: Ubuntu 20.04.1 LTS running the 5.4.0-58-generic kernel.) Notice the two prints toward the bottom of Figure 1.4; the first is emitted via the pr_info() (prefixed with the pr_fmt() macro content, as explained in the companion guide Linux Kernel Programming - Chapter 4, Writing Your First Kernel Module - LKMs Part 1 section Standardizing printk output via the pr_fmt macro). The second print is emitted via the dev_info() helper routine – it's prefixed with the words misc llkd_miscdrv, indicating that it originated from the kernel's misc framework, and specifically from the llkd_miscdrv device! (The dev_xxx() routines are versatile; depending on the bus they're on, they will display various details. This is useful for debugging and logging purposes. We repeat: you're recommended to use the dev_*() routines when writing drivers.) You can also see that the /dev/llkd_miscdrv device node is indeed created, with the expected type (character) and major and minor pair (10 and 56 here).

Writing the misc driver code – part 3

Now, the init code is done, the driver functionality has been set up via the file operations structure, and the driver is registered to the kernel misc framework. So, what happens next? Well, nothing really, until a process opens the device file (associated with your driver) and performs I/O (Input/Output, i.e., reads/writes) of some sort.

So, let's assume that a user-mode process (or thread) issues the open(2) system call on your driver's device node (recall, the device node has been auto-created when the driver registered itself to the kernel's misc framework). Most important, as you learned in the Understanding the connection between the process, the driver, and the kernel section, for any file-related system calls issued upon your device node, the VFS will essentially invoke the driver's (f_op) registered method. So, here, the VFS will do this: filp->f-op->open(), thus invoking our driver's open method within our file_operations structure, which is the open_miscdrv() function!

But how should you, the driver author, implement this code of the open method of your driver? The key point is this: the signature of your open function should be identical to that of the file_operation structure open; in fact, this is true of any function. Thus, we implement the open_miscdrv() function like this:

/*
 * open_miscdrv()
 * The driver's open 'method'; this 'hook' will get invoked by the kernel VFS
 * when the device file is opened. Here, we simply print out some relevant info.
 * The POSIX standard requires open() to return the file descriptor on success;
 * note, though, that this is done within the kernel VFS (when we return). So,
 * all we do here is return 0 indicating success.
 * (The nonseekable_open(), in conjunction with the fop's llseek pointer set to
 * no_llseek, tells the kernel that our device is not seek-able).
 */
static int open_miscdrv(struct inode *inode, struct file *filp)
{
    char *buf = kzalloc(PATH_MAX, GFP_KERNEL);

    if (unlikely(!buf))
        return -ENOMEM;
    PRINT_CTX(); // displays process (or atomic) context info
    pr_info(" opening \"%s\" now; wrt open file: f_flags = 0x%x\n",
        file_path(filp, buf, PATH_MAX), filp->f_flags);
    kfree(buf);
    return nonseekable_open(inode, filp);
}

Notice how the signature of our open routine, the open_miscdrv() function, precisely matches that of the f_op structure's open function pointer (you can always lookup the file_operations structure for 5.4 Linux here at https://elixir.bootlin.com/linux/v5.4/source/include/linux/fs.h#L1814).

In this simple driver, in our open method, we don't really have much to do. We allocate some memory for a buffer (to hold the pathname of our device) via kzalloc(), issue our PRINT_CTX() macro (it's in the convenient.h header) to show the current context – the process that is currently opening the device. We then emit a printk (via pr_info()) showing a few VFS layer details (the pathname and open flags value); you can get the path name of a file by using the convenience API file_path(), as we do here (to do so, we need to allocate and, after usage, free a kernel memory buffer). Then, as we don't support seeking in this driver, we invoke the nonseekable_open() API (as discussed in the Handling unsupported methods section).

The open(2) system call on the device file should succeed. The user-mode process will now have a valid file descriptor – a handle to the open file (which, here, is actually a device node). Now, let's say the user-mode process wants to read data from the hardware; it therefore issues the read(2) system call. As explained already, the kernel VFS will now auto-invoke our driver's read method, read_miscdrv(). Again, its signature exactly imitates the read function signature from the file_operations data structure. Here's the simple code of our driver's read method:

/*
 * read_miscdrv()
 * The driver's read 'method'; it has effectively 'taken over' the read syscall
 * functionality! Here, we simply print out some info.
 * The POSIX standard requires that the read() and write() system calls return
 * the number of bytes read or written on success, 0 on EOF (for read) and -1 (-ve errno)
 * on failure; we simply return 'count', pretending that we 'always succeed'.
 */
static ssize_t read_miscdrv(struct file *filp, char __user *ubuf, size_t count, loff_t *off)
{
        pr_info("to read %zd bytes\n", count);
        return count;
}

The preceding comment is self-explanatory. Within it, we emit pr_info(), showing the number of bytes the user space process wants to read. Then, we simply return the number of bytes read, implying success! In reality, we have done (essentially) nothing. The remaining driver methods are quite similar.

Testing our simple misc driver

Let's test our really simple skeleton misc character driver (in the ch1/miscdrv directory; we assume you have built and inserted it as shown in Figure 1.4). We test it by issuing open(2), read(2), write(2), and close(2) system calls upon it; how exactly can we do so? We can always write a small C program to do precisely this, but an easier way is to use the useful dd(1) "disk duplicator" utility. We use it like this:

dd if=/dev/llkd_miscdrv of=readtest bs=4k count=1

Internally dd opens the file we pass it as a parameter (/dev/llkd_miscdrv) via if= (here, it's the first parameter to dd; if= specifies the input file), it will read from it (via the read(2) system call, of course). The output is to be written to the file specified by the parameter of= (the second parameter to dd, and is a regular file named readtest); the bs specifies the block size to perform I/O in and count is the number of times to perform I/O). After performing the required I/O, the dd process will close(2) the files. This sequence is reflected in the kernel log (Figure 1.5):

Figure 1.5 – Screenshot showing us minimally testing our miscdrv driver's read method via dd(1)

After verifying that our driver (LKM) is inserted, we issue the dd(1) command, having it read 4,096 bytes from our device (as the block size (bs) is set to 4k and count to 1). We have it write the output (via the of= option switch) to a file named readtest. Looking up the kernel log, you can see (Figure 1.5) that the dd process has indeed opened our device (our PRINT_CTX() macro's output shows that it's the process context currently running the code of our driver!). Next, we can see (via the output from pr_fmt()) that control goes to our driver's read method, within which we emit a simple printk and return the value 4096 signifying success (though we really didn't read anything!). The device is then closed by dd. Furthermore, a quick check with the hexdump(1) utility reveals that we did indeed receive 0x1000 (4,096) nulls (as expected) from the driver (in the file readtest; do realize that this is the case because dd initialized it's read buffer to NULLs).

The PRINT_CTX() macro we have used within the code lives within our convenient.h header. Do take a look; it's quite instructive (we try and emulate the kernel Ftrace infrastructure's latency output format, which reveals a lot of detail in a small space, a single line of output). This is explained in detail in Chapter 4, Handling Hardware Interrupts, in the Fully figuring out the context section. Don't worry about all the details for now...

Figure 1.6 shows how we (minimally) test writing to our driver, again via dd(1). This time we read 4k of random data (by leveraging the kernel's built-in mem driver's /dev/urandom facility), and write the random data to our device node; in effect, to our 'device':

Figure 1.6 – Screenshot showing us minimally testing our miscdrv driver's write method via dd(1)

(By the way, I have also included a simple user space test app for the driver; it can be found here: ch1/miscdrv/rdwr_test.c. I will leave it to you to read its code and try out.)

You might be thinking: we did apparently succeed in reading and writing data to and from user space to our driver, but, hang on, we never actually saw any data transfer taking place within the driver code. Yes, this is the topic of the next section: how you will actually copy the data from the user space process buffer into your kernel driver's buffer, and vice versa. Read on!

Copying data from kernel to user space and vice versa

A primary job of the device driver is to enable user space applications to transparently both read and write data to the peripheral hardware device (typically a chip of some sort; it may not be hardware at all though), treating the device as though it were simply a regular file. Thus, to read data from the device, the application opens the device file corresponding to that device, thus obtaining a file descriptor, and then simply issues a read(2) system call using that fd (step 1 in Figure 1.7)! The kernel VFS intercepts the read, and, as we have seen, has control flow to the underlying device driver's read method (which is a C function, of course). The driver code now "talks" to the hardware device, actually performing the I/O, the read operation. (The specifics of how exactly the hardware read (or write) is performed depends very much on the type of hardware – is it a memory-mapped device, a port, a network chip, and so on? We will not delve further into this here; the next chapter does.) The driver, having read data from the device, now places this data into a kernel buffer, kbuf (step 2 in the following diagram. Of course, we assume the driver author allocated memory for it via [k|v]malloc() or another suitable kernel API).

We now have the hardware device data in a kernel space buffer. How should we transfer it to the user space process's memory buffer? We shall exploit kernel APIs that make it easy to do so; this is covered next.

Leveraging kernel APIs to perform the data transfer

Now, as mentioned previously, let's assume your driver has read in the hardware data, and that it's now present in a kernel memory buffer. How do we transfer it to user space? A naive approach would be to simply try and perform this via memcpy(), but no, that does not work (why? one, it's insecure and two, it's very arch-dependent; it works on some architectures and not on others). So, a key point: the kernel provides a couple of inline functions to transfer data from kernel to user space and vice versa. They are copy_to_user() and copy_from_user(), respectively, and are indeed very commonly used.

Using them is simple. Both take three parameters: the to pointer (destination buffer), the from pointer (source buffer), and n, the number of bytes to copy (think of it as you would for a memcpy operation):

include <linux/uaccess.h>   /* Note! used to be <asm/uaccess.h> upto 4.11 */

unsigned long copy_to_user(void __user *to, const void *from, unsigned long n);
unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);

The return value is the number of uncopied bytes; in other words, a return value of 0 indicates success and a non-zero return value indicates that the given number of bytes were not copied. If a non-zero return occurs, you should (following the usual 0/-E return convention) return an error indicating an I/O fault by returning -EIO or -EFAULT (which thus sets errno in user space to the positive counterpart). The following (pseudo) code illustrates how a device driver can use the copy_to_user() function to copy some data from kernel to user space:

static ssize_t read_method(struct file *filp, char __user *ubuf, size_t count, loff_t *off)
{
     char *kbuf = kzalloc(...);
     [ ... ]
     /* ... do what's required to get data from the hardware device into kbuf ... */
    if (copy_to_user(buf, kbuf, count)) {
        dev_warn(dev, "copy_to_user() failed\n");
        goto out_rd_fail;
    }
    [ ... ]
    return count;    /* success */
out_rd_fail:
    kfree(kbuf);
 return -EIO; /* or -EFAULT */
}

Here, of course, we assume you have a valid allocated kernel memory buffer, kbuf, and a valid device pointer (struct device *dev). Figure 1.7 illustrates what the preceding (pseudo) code is trying to achieve:

Figure 1.7 – Read: copy_to_user(): copying data from the hardware to a kernel buffer and from there to a user space buffer

The same semantics apply to using the copy_from_user() inline function. It is typically used in the context of the driver's write method, pulling in the data written by the user space process context to a kernel space buffer. We will leave it to you to visualize this.

It is also important to realize that both routines (copy_[from|to]_user()) might, during their run, cause the process context to (page) fault and thus sleep; in other words, to invoke the scheduler. Hence, they can only be used in a process context where it's safe to sleep and never in any kind of atomic or interrupt context (we explain more on the might_sleep() helper – a debug aid – in Chapter 4, Handling Hardware Interrupts, in the Don't block – spotting possibly blocking code paths section).

For the curious reader (I hope you are one!), here are some links with a bit more of a detailed explanation on why you cannot just use a simple memcpy() but must use the copy_[from|to]_user() inline functions to copy data from and to the kernel and user spaces:

In the following section, we shall write a more complete misc framework character device driver, which will actually perform some I/O, reading and writing data.

A misc driver with a secret

Now that you understand how to copy data between user and kernel space (and the reverse), let's write another device driver (ch1/miscdrv_rdwr) based on our previous skeleton (ch1/miscdrv/) miscellaneous driver. The key difference is that we use a few global data items (within a structure) throughout, and actually perform some I/O in the form of reads and writes. Here, let's introduce the notion of a driver context or private driver data structure; the idea is to have a conveniently accessible data structure that contains all relevant information in one place. Here, we name this structure struct drv_ctx (see it in the code listing that follows). On driver initialization, we allocate memory to and initialize it.

Okay, there's no real secret here, it just makes it sound interesting. One of the members within this driver context data structure of ours is a so-called secret message (it's the drv_ctx.oursecret member, along with some (fake) statistics and config words). This is the simple "driver context" or private data structure we propose using:

// ch1/miscdrv_rdwr/miscdrv_rdwr.c
[ ... ]
/* The driver 'context' (or private) data structure;
 * all relevant 'state info' reg the driver is here. */
struct drv_ctx {
    struct device *dev;
    int tx, rx, err, myword;
    u32 config1, config2;
    u64 config3;
#define MAXBYTES 128 /* Must match the userspace app; we should actually
                      * use a common header file for things like this */
    char oursecret[MAXBYTES];
};
static struct drv_ctx *ctx;

Great; now let's move on to seeing and understanding the code.

Writing the 'secret' misc device driver's code

We've divided this discussion on the implementation details of our secret misc character device driver into five parts: driver initialization, the read method, the write method functionality implementation, the driver cleanup, and finally, the userspace application that will use our device driver.

Our secret driver – the init code

In the init code of our secret device driver (a kernel module, of course, thus invoked upon insmod(8)), we first register the driver as a misc character driver with the kernel (via the misc_register() API, as seen in the Writing the misc driver code – part 1 section earlier; we won't repeat this code here).

Next, we allocate kernel memory for our driver's "context" structure – via the useful managed allocation devm_kzalloc() API (as you learned in the companion guide Linux Kernel Programming, Chapter 8, Kernel Memory Allocation for Module Authors – Part 1, in the Using the kernel's resource-managed memory allocation APIs section) – and initialize it. Notice that you must ensure you first get the device pointer dev before you can use this API; we retrieve it from our miscdevice structure's this_device member (as seen):

// ch1/miscdrv_rdwr/miscdrv_rdwr.c
[ ... ]
static int __init miscdrv_rdwr_init(void)
{
    int ret;
    struct device *dev;

    ret = misc_register(&llkd_miscdev);
    [ ... ]
    dev = llkd_miscdev.this_device;
    [ ... ]
    ctx = devm_kzalloc(dev, sizeof(struct drv_ctx), GFP_KERNEL);
    if (unlikely(!ctx))
        return -ENOMEM;

    ctx->dev = dev;
    strscpy(ctx->oursecret, "initmsg", 8);
    [ ... ]
    return 0;         /* success */
}

Okay, clearly, we have initialized the dev member of our ctx private structure instance as well as the 'secret' string to the 'initmsg' string (not a very convincing secret, but let's leave it at that). The idea here is that when a user space process (or thread) opens our device file and issues read(2) upon it, we pass back (copy) the secret to it; we do so by invoking the copy_to_user() helper function! Similarly, when the user-mode app writes data to us (yes, via the write(2) system call), we consider that data written to be the new secret. So, we fetch it from its user space buffer – via the copy_from_user() helper function – and update it in driver memory.

Why not simply use the strcpy() (or strncpy()) API to initialize the ctx->oursecret member? This is very important: they aren't safe enough security-wise. Also, the strlcpy() API has been marked as deprecated by the kernel community (https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy). In general, always avoid using deprecated stuff, as documented in the kernel documentation here: https://www.kernel.org/doc/html/latest/process/deprecated.html#deprecated-interfaces-language-features-attributes-and-conventions.

Quite clearly, the interesting parts of this new driver are the I/O functionality – the read and write methods; on with it!

Our secret driver – the read method

We will first show the relevant code for the read method – this is how a user space process (or thread) can read in the secret information housed within our driver (in its context structure):

static ssize_t
read_miscdrv_rdwr(struct file *filp, char __user *ubuf, size_t count, loff_t *off)
{
    int ret = count, secret_len = strlen(ctx->oursecret);
    struct device *dev = ctx->dev;
    char tasknm[TASK_COMM_LEN];

    PRINT_CTX();
    dev_info(dev, "%s wants to read (upto) %zd bytes\n", get_task_comm(tasknm, current), count);

    ret = -EINVAL;
    if (count < MAXBYTES) {
    [...] << we don't display some validity checks here >>

    /* In a 'real' driver, we would now actually read the content of the
     * [...]
     * Returns 0 on success, i.e., non-zero return implies an I/O fault).
     * Here, we simply copy the content of our context structure's 
     * 'secret' member to userspace. */
    ret = -EFAULT;
    if (copy_to_user(ubuf, ctx->oursecret, secret_len)) {
        dev_warn(dev, "copy_to_user() failed\n");
        goto out_notok;
    }
    ret = secret_len;

    // Update stats
    ctx->tx += secret_len; // our 'transmit' is wrt this driver
    dev_info(dev, " %d bytes read, returning... (stats: tx=%d, rx=%d)\n",
            secret_len, ctx->tx, ctx->rx);
out_notok:
    return ret;
}

The copy_to_user() routine does its job – it copies the ctx->oursecret source buffer to the destination pointer, the ubuf user space buffer, for secret_len bytes, thus transferring the secret to the user space app. Now, let's check out the driver's write method.

Our secret driver – the write method

The end user can change the secret by writing a new secret into the driver, via a write(2) system call to the driver's device node. The kernel redirects the write (via the VFS layer) to our driver's write method (as you learned in the Understanding the connection between the process, the driver, and the kernel section):

static ssize_t
write_miscdrv_rdwr(struct file *filp, const char __user *ubuf, size_t count, loff_t *off)
{
    int ret = count;
    void *kbuf = NULL;
    struct device *dev = ctx->dev;
    char tasknm[TASK_COMM_LEN];

    PRINT_CTX();
    if (unlikely(count > MAXBYTES)) { /* paranoia */
        dev_warn(dev, "count %zu exceeds max # of bytes allowed, "
                "aborting write\n", count);
        goto out_nomem;
    }
    dev_info(dev, "%s wants to write %zd bytes\n", get_task_comm(tasknm, current), count);

    ret = -ENOMEM;
    kbuf = kvmalloc(count, GFP_KERNEL);
    if (unlikely(!kbuf))
        goto out_nomem;
    memset(kbuf, 0, count);

    /* Copy in the user supplied buffer 'ubuf' - the data content
     * to write ... */
    ret = -EFAULT;
    if (copy_from_user(kbuf, ubuf, count)) {
        dev_warn(dev, "copy_from_user() failed\n");
        goto out_cfu;
     }

    /* In a 'real' driver, we would now actually write (for 'count' bytes)
     * the content of the 'ubuf' buffer to the device hardware (or 
     * whatever), and then return.
     * Here, we do nothing, we just pretend we've done everything :-)
     */
    strscpy(ctx->oursecret, kbuf, (count > MAXBYTES ? MAXBYTES : count));
    [...]
    // Update stats
    ctx->rx += count; // our 'receive' is wrt this driver

    ret = count;
    dev_info(dev, " %zd bytes written, returning... (stats: tx=%d, rx=%d)\n",
            count, ctx->tx, ctx->rx);
out_cfu:
    kvfree(kbuf);
out_nomem:
    return ret;
}

We employ the kvmalloc() API to allocate memory for a buffer to hold the user data that we will copy in. The actual copying is done via the copy_from_user() routine, of course. Here, we use it to copy the data passed by the user space app to our kernel buffer, kbuf. We then (via the strscpy() routine) update our driver's context structure's oursecret member to this value, thus updating the secret! (A subsequent read on the driver will now reveal the new secret.) Also, do notice the following:

How we now consistently use the dev_xxx() helpers in place of the usual printk routines. This is recommended for device drivers.
The (now typical) usage of goto to perform optimal error handling.

This covers the meat of the driver.

Our secret driver – cleanup

It's important to realize that we must free any buffers we have allocated. Here, however, as we performed a managed allocation in the init code (devm_kzalloc()), we have the benefit of not needing to worry about cleanup; the kernel handles it. Of course, in the driver's cleanup code path (invoked upon rmmod(8)), we deregister the misc driver with the kernel:

static void __exit miscdrv_rdwr_exit(void)
{
    misc_deregister(&llkd_miscdev);
    pr_info("LLKD misc (rdwr) driver deregistered, bye\n");
}

You will notice that we also, seemingly uselessly, use two global integers, ga and gb, in places in this version of the driver. Indeed, they have no real meaning here; the reason we have them at all becomes clear only in the last two chapters of this book, on kernel synchronization. Please ignore them for now.

On this note, you'll perhaps realize that the way we have arbitrarily accessed global data in this driver can cause concurrency issue (data races!); yes indeed; we shall set aside the deep and crucial coverage of kernel concurrency and synchronization to the book's last two chapters.

Our secret driver – the user space test app

Writing just the kernel component, the device driver, isn't quite enough; you also have to write a user space application that will actually make use of the driver. We will do so here. (Again, you could simply use dd(1) as well.)

In order to use the device driver, the user space app must first, of course, open the device file corresponding to it. (Here, to save space, we don't show the app code in its entirety, just the most relevant portions of it. We expect you to have cloned the book's Git repository and to work on the code.) The code to open the device file is as follows:

// ch1/miscdrv_rdwr/rdwr_test_secret.c
int main(int argc, char **argv)
{
    char opt = 'r';
    int fd, flags = O_RDONLY;
    ssize_t n;
    char *buf = NULL;
    size_t num = 0;
[...]
    if ('w' == opt)
        flags = O_WRONLY;
    fd = open(argv[2], flags, 0);
    if (fd == -1) {
    [...]

The second argument to this app is the device file to open. In order to read or write, the process will require memory:

    if ('w' == opt)
        num = strlen(argv[3])+1;    // IMP! +1 to include the NULL byte!
    else
        num = MAXBYTES;
    buf = malloc(num);
    if (!buf) {
        [...]

Moving along, let's see the block of code to have the app invoke a read or write (depending on the first parameter being r or w) on the (pseudo)device (for conciseness, we don't show the error handling code):

    if ('r' == opt) {
        n = read(fd, buf, num);
        if( n < 0 ) [...]
        printf("%s: read %zd bytes from %s\n", argv[0], n, argv[2]);
        printf("The 'secret' is:\n \"%.*s\"\n", (int)n, buf);
    } else {
        strncpy(buf, argv[3], num);
        n = write(fd, buf, num);
        if( n < 0 ) [ ... ]
        printf("%s: wrote %zd bytes to %s\n", argv[0], n, argv[2]);
    }
    [...]
    free(buf);
    close(fd);
    exit(EXIT_SUCCESS); 
}

(Before you try out this driver, do ensure the previous miscdrv driver's kernel module is unloaded.) Now, ensure that this driver is built and inserted, of course, else it will result in the open(2) system call failing. We have shown a couple of trial runs. First, let's build the user-mode app, insert the driver (not shown in Figure 1.8), and read from our just-created device node:

Figure 1.8 – miscdrv_rdwr: (minimally) testing the read; the original secret is revealed

The user-mode app successfully receives 7 bytes from the driver; it's the (initial) secret value, which it displays. The kernel log reflects the driver initialization, and a few seconds later, you can see (via the dev_xxx() instances of printk we emitted) that the rdwr_test_secret app runs the drivers' code in process context. The opening of the device, the running of the subsequent read, and the close methods are clearly seen. (Notice how the process name is truncated to rdwr_test_secre; this is as the task structure's comm member is the process name truncated to 16 characters.)

In Figure 1.9, we show the complementary act of writing to our device node, changing the secret value; a subsequent read indeed reveals that it has worked:

Figure 1.9 – miscdrv_rdwr: (minimally) testing the write; a new, excellent secret is written

The portion of the kernel log where the write takes place is highlighted in Figure 1.9. It works; I definitely encourage you to try this out yourself, looking up the kernel log as you go along.

Now, it's time to dig a little deeper. The reality is that as a driver author, you have to learn to be really careful regarding security, else all kinds of nasty surprises lie in wait. The next section gives you an understanding of this key area.

Issues and security concerns

An important consideration, for the budding driver author, is security. The trouble is, naive usage of even the very common copy_[from|to]_user() functions within your driver can let a malicious user quite easily – and illegally – overwrite memory to their advantage in both user and kernel spaces. How? The following section explains this in some detail; then, we will even show you a (bit contrived, but nevertheless, working) hack.

Hacking the secret driver

Think about this: we have the copy_to_user() helper routine; the first parameter is the destination to address, which should be a user space virtual address (a UVA), of course. Regular usage will comply with this and provide a legal and valid user space virtual address as the destination address, and all will be well.

But what if we don't? What if we pass another user space address, or, check this out – a kernel virtual address (a KVA) – in its place? The copy_to_user() code will now, running with kernel privileges, overwrite the destination with whatever data is in the source address (the second parameter) for the number of bytes in the third parameter! Indeed, hackers often attempt techniques such as this, to insert code posing as data into a user space buffer and execute it with kernel privilege, leading to a quite deadly privilege escalation (privesc) scenario.

To clearly demonstrate the adverse effects of not carefully designing and implementing a driver, we deliberately introduce errors (bugs, really!) into both the read and write methods of a 'bad' version of our previous driver (although here, we only consider the scenario with respect to the very common copy_[from|to]_user() routines and nothing else).

To get a more hands-on feel for this, we will write a "bad" version of our ch1/miscdrv_rdwr driver. We'll call it (ever so cleverly) ch1/bad_miscdrv. In this version, we deliberately have two buggy code paths built into it:

One within the driver's read method
The other, the more exciting one, as you shall soon see, within the write method.

Let's check both out. We'll begin with the buggy read.

Bad driver – buggy read()

To help you see what's changed in the code, we first perform a diff(1) of this (deliberately) bad driver code with our previous (good) version, yielding the differences, of course (in the following snippet, we curtail the output to only what's most relevant):

// in ch1/bad_miscdrv
$ diff -u ../miscdrv_rdwr/miscdrv_rdwr.c bad_miscdrv.c
[ ... ]
+#include <linux/cred.h>            // access to struct cred
#include "../../convenient.h"
[ ... ]
static ssize_t read_miscdrv_rdwr(struct file *filp, char __user *ubuf,
[ ... ]
+ void *kbuf = NULL;
+ void *new_dest = NULL;
[ ... ]
+#define READ_BUG
+//#undef READ_BUG
+#ifdef READ_BUG
[ ... ]
+ new_dest = ubuf+(512*1024);
+#else
+ new_dest = ubuf;
+#endif
[ ... ]
+ if (copy_to_user(new_dest, ctx->oursecret, secret_len)) {
[ ... ]

So, it should be quite clear: in our 'bad' driver's read method, if the READ_BUG macro is defined, we alter the user space destination pointer to point to an illegal location (512 KB beyond the location we should actually copy the data to!). This demonstrates the point here: we can do arbitrary stuff like this because we are running with kernel privileges. That it will cause issues and bugs is a separate matter.

Let's try it: first, do ensure that you've built and loaded the bad_miscdrv kernel module (you can use our lkm convenience script to do so). Our trial run, issuing a read(2) system call via our ch1/bad_miscdrv/rdwr_test_hackit user-mode app, results in failure (see the following screenshot):

Figure 1.10 – Screenshot showing our bad_miscdrv misc driver performing a "bad" read

Ah, this is interesting; our test application's (rdwr_test_hackit) read(2) system call does indeed fail, with the perror(3) routine indicating the cause of failure as Bad address. But why? Why didn't the driver, running with kernel privileges, actually write to the destination address (here, 0x5597245d46b0 , the wrong one; as we know, it's attempting to write 512 KB ahead of the correct destination address. We deliberately wrote the driver's read method code to do so).

This is because kernel ensures that the copy_[from|to]_user() routines will (ideally) fail when attempting to read or write illegal addresses! Internally, several checks are done: access_ok() is a simple one merely ensuring that I/O is performed within the expected segment (user or kernel). Modern Linux kernels have superior checking; besides the simple access_ok() check, the kernel then wades through – if enabled – the KASAN (Kernel Address Sanitizer, a compiler instrumentation feature; KASAN is indeed very useful, a must-do during development and test!), checks on object sizes (including overflow checks), and only then does it invoke the worker routine that performs the actual copy, raw_copy_[from|to]_user().

Okay, that's good; now, let's move on to the more interesting case, the buggy write, which we shall arrange (in a contrived manner though) to make into an attack! Read on...

Bad driver – buggy write() – a privesc!

What does the malicious hacker really want, their holy grail? A root shell on the system, of course (got root?). With a good deal of contrived code within our driver's write method (thus making this hack not a really good one; it's quite academic), let's go get it! To do so, we modify both the user-mode app as well as the device driver. Let's look at the user-mode app's changes first.

User space test app modifications

We slightly modify the user space application – our process context, in effect. This particular version of the user-mode test app differs from the earlier one in one regard: we now have a macro called HACKIT. If it's defined (it is by default), this process will deliberately write only zeroes into the user space buffer and send that to our bad driver's write method. If the driver has the DANGER_GETROOT_BUG macro defined (it is by default), then it will write the zeroes into the process's UID member, thus making the user-mode process obtain root privileges!

In the traditional Unix/Linux paradigm, if the Real User ID (RUID) and/or Effective User ID (EUID) (they're within the task structure, in struct cred) are set to the special value zero (0), it implies that the process has superuser (root) powers. Nowadays, the POSIX Capabilities model is considered a superior way to work with privileges, as it allows assigning fine-grained permissions – capabilities – on a thread, as opposed to giving a process or thread complete control over the system as root.

Here's a quick diff of the user space test app from the previous version, allowing you to see the changes made to the code (again, we curtail the output to only what's most relevant):

// in ch1/bad_miscdrv
$ diff -u ../miscdrv/rdwr_test.c rdwr_test_hackit.c
[ ... ]
+#define HACKIT
[ ... ]
+#ifndef HACKIT
+     strncpy(buf, argv[3], num);
+#else
+     printf("%s: attempting to get root ...\n", argv[0]);
+     /*
+      * Write only 0's ... our 'bad' driver will write this into
+      * this process's current->cred->uid member, thus making us
+      * root !
+      */
+     memset(buf, 0, num);
 #endif
- } else { // test writing ..
          n = write(fd, buf, num);
[ ... ]
+     printf("%s: wrote %zd bytes to %s\n", argv[0], n, argv[2]);
+#ifdef HACKIT
+     if (getuid() == 0) {
+         printf(" !Pwned! uid==%d\n", getuid());
+         /* the hacker's holy grail: spawn a root shell */
+         execl("/bin/sh", "sh", (char *)NULL);
+     }
+#endif
[ ... ]

This does imply that the (so-called) secret never gets written; that's okay. Now, let's look at the modifications made to the driver.

Device driver modifications

To see how our bad misc driver's write method changes, we will continue looking at the same diff (of our bad versus good drivers) that we did in the Bad driver – buggy read() section. The comments in the code from the following diff operation are quite self-explanatory. Check it out:

// in ch1/bad_miscdrv
$ diff -u ../miscdrv_rdwr/miscdrv_rdwr.c bad_miscdrv.c
[...]           
         // << this is within the driver's write method >>
 static ssize_t write_miscdrv_rdwr(struct file *filp, const char __user *ubuf,
 size_t count, loff_t *off)
 {
        int ret = count;
        struct device *dev = ctx->dev;
+       void *new_dest = NULL;
[ ... ]
+#define DANGER_GETROOT_BUG
+//#undef DANGER_GETROOT_BUG
+#ifdef DANGER_GETROOT_BUG
+     /* Make the destination of the copy_from_user() point to the current
+      * process context's (real) UID; this way, we redirect the driver to
+      * write zero's here. Why? Simple: traditionally, a UID == 0 is what
+      * defines root capability!
+      */
+      new_dest = &current->cred->uid;
+      count = 4; /* change count as we're only updating a 32-bit quantity */
+      pr_info(" [current->cred=%px]\n", (TYPECST)current->cred);
+#else
+      new_dest = kbuf;
+#endif

The key point from the preceding code is that when the DANGER_GETROOT_BUG macro is defined (it is by default), we set the new_dest pointer to the address of the (real) UID member within the credential structure, which is itself within the task structure (referenced by current) for this process context! (If all of this sounds foreign, please read the companion guide Linux Kernel Programming, Chapter 6, Kernel Internals Essentials – Processes and Threads). This way, when we invoke the copy_to_user() routine to perform the write to user space, it's going to actually write zeroes to the process UID member within current->cred. A UID of zero is what (traditionally) defines root. Also, notice how we restrict the write to 4 bytes (as we're just writing a 32-bit quantity).

(By the way, the build on our "bad" driver does issue a warning; here, with it being intentional, we merely ignore it):

Linux-Kernel-Programming-Part-2/ch1/bad_miscdrv/bad_miscdrv.c:229:11: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  229 | new_dest = &current->cred->uid;
      |          ^

Here's the copy_from_user() code invocation:

[...]
+       dev_info(dev, "dest addr = " ADDRFMT "\n", (TYPECST)new_dest);
        ret = -EFAULT;
-       if (copy_from_user(kbuf, ubuf, count)) {
+       if (copy_from_user(new_dest, ubuf, count)) {
                dev_warn(dev, "copy_from_user() failed\n");
                goto out_cfu;
        }
[...]

Clearly, the preceding copy_to_user() routine will write the user-supplied buffer, ubuf, into the new_dest destination buffer – which, crucially, we have made point to current->cred->uid – for count bytes.

Let's get root now

Of course, the proof of the pudding is in the eating, yes? So, let's give our hack a spin; here, we assume that you've first unloaded any previous version of our 'misc' drivers, and built and loaded the bad_miscdrv kernel module into memory:

Figure 1.11 – Screenshot showing our bad_miscdrv misc driver performing a "bad" write, resulting in root – a privesc!

Check it out; we indeed got root! Our rdwr_test_hackit app, detecting that we do have root (via a simple getuid(2)system call), then does the logical thing: it execs a root shell (via an execl(3) API), and voilà, we land up in a root shell. We show the kernel log:

$ dmesg 
[ 63.847549] bad_miscdrv:bad_miscdrv_init(): LLKD 'bad' misc driver (major # 10) registered, minor# = 56
[ 63.848452] misc bad_miscdrv: A sample print via the dev_dbg(): (bad) driver initialized
[ 84.186882] bad_miscdrv:open_miscdrv_rdwr(): 000) rdwr_test_hacki :2765 | ...0 /* open_miscdrv_rdwr() */
[ 84.190521] misc bad_miscdrv: opening "bad_miscdrv" now; wrt open file: f_flags = 0x8001
[ 84.191557] bad_miscdrv:write_miscdrv_rdwr(): 000) rdwr_test_hacki :2765 | ...0 /* write_miscdrv_rdwr() */
[ 84.192358] misc bad_miscdrv: rdwr_test_hacki wants to write 4 bytes to (original) ubuf = 0x55648b8f36b0
[ 84.192971] misc bad_miscdrv: [current->cred=ffff9f67765c3b40]
[ 84.193392] misc bad_miscdrv: dest addr = ffff9f67765c3b44 count=4
[ 84.193803] misc bad_miscdrv: 4 bytes written, returning... (stats: tx=0, rx=4)
[ 89.002675] bad_miscdrv:close_miscdrv_rdwr(): 000) [sh]:2765 | ...0 /* close_miscdrv_rdwr() */
[ 89.005992] misc bad_miscdrv: filename: "bad_miscdrv"
$

You can see how it's worked: the original user-mode buffer ubuf kernel virtual address is 0x55648b8f36b0. In the hack, we modify it to the new destination address (kernel virtual address), 0xffff9f67765c3b44, which is (in this case) the kernel virtual address of the UID member of struct cred (within the process's task structure). Not only that, but our driver also modifies the number of bytes to write (count) to 4 (bytes), as we're updating a 32-bit quantity.

Do note: these hacks are just that – hacks. They could certainly cause your system to become unstable (when run on our "debug" kernel, KASAN, in fact, detected a null pointer dereference!).

These demos prove nothing but the fact that you as a kernel and/or driver author must be alert to programming issues, security, and more at all times. With this, we complete this section and indeed the chapter.

Summary

This concludes this chapter on writing a simple misc class character device driver on the Linux OS; so, awesome, you now know the basics of writing a device driver on Linux!

The chapter began with an introduction to device basics, and importantly, the very brief essentials of the modern LDM. You then learned how to write a simple first character device driver, registering with the kernel's misc framework. Along the way, you also understood the connection between the process, the driver, and the kernel VFS. Copying data between user and kernel address spaces is essential; we saw how to do so. A more comprehensive demo misc driver (our 'secret' driver) showed you how to perform I/O – reads and writes – ferrying data between user and kernel space. A key part of this chapter is the last section, where you learned (well, made a start at least) about security and the driver; a "hack" even demonstrated a privesc attack!

As mentioned before, there's much more to this vast topic of writing drivers on Linux; indeed, whole books are devoted to it! Do check out the Further reading section for this chapter to find relevant books and online references.

In the following chapter you will learn a key task for a driver author - how exactly can you efficiently interface your device driver with user space processes; several useful approaches are covered in detail and contrasted. Do ensure you're clear on this chapter's material, work on the exercises given, review the Further reading resources and then dive into the next one.

Questions

Load up the first miscdrv skeleton misc driver kernel module and issue lseek(2) on it; what happens? (Does it succeed? What's the return value from lseek?) If not, okay, how will you fix this?
Write a misc class character driver that behaves as a simple converter program (assume its path name is /dev/convert). For example, writing the temperature in Fahrenheit units, it should return (write to the kernel log) the temperature in Celsius. Thus, doing echo 98.6 > /dev/convert should result in the value 37 C being written to the kernel log. Additionally, do the following:
1. Validate that the data passed to your driver is a numeric value.
2. How will you handle floating-point values? (Tip: refer to the section Floating point not allowed in the kernel in Linux Kernel Programming, Chapter 5, Writing Your First Kernel Module LKMs – Part 2.)
Write a "task display" driver; here, we'd like a user space process to write a thread (or process) PID to it. When you now read from the driver's device node (assume its path name is /dev/task_display), you should receive details regarding the task (which is pulled from its task structure, of course). For example, doing echo 1 > /dev/task_display followed by cat /dev/task_display should have the driver emit task details of PID 1 to the kernel log. Don't forget to add validity checks (check the PID is valid, and so on).
(A bit more advanced:) Write a "proper" LDM-based driver; the misc drivers covered here did register with the kernel's misc framework, but simply, implicitly, used the raw character interface as the bus. The LDM prefers that a driver must register with a kernel framework and a bus driver. Hence, write a "demo" driver that registers itself with the kernel's misc framework and the platform bus. This will involve creating a fake platform device as well.
(Note the following tips:
a) Do refer to Chapter 2, User-Kernel Communication Pathways, particularly the Creating a simple platform device and Platform devices sections.
b) A possible solution to this driver can be found here: solutions_to_assgn/ch12/misc_plat/.)