Home

Mastering Linux Kernel Development

By CH Raghav Maruthi

Book

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Mastering Linux Kernel Development looks at the Linux kernel, its internal arrangement and design, and various core subsystems, helping you to gain significant understanding of this open source marvel. You will look at how the Linux kernel, which possesses a kind of collective intelligence thanks to its scores of contributors, remains so elegant owing to its great design. This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel’s core services and mechanisms. You will also look at the Linux kernel as well-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe. By the end of this book, you will have considerable understanding of and appreciation for the Linux kernel.

Publication date:: October 2017
Publisher: Packt
Pages: 354
ISBN: 9781785883057

Chapter 1. Comprehending Processes, Address Space, and Threads

When kernel services are invoked in the current process context, its layout throws open the right path for exploring kernels in more detail. Our effort in this chapter is centered around comprehending processes and the underlying ecosystem the kernel provides for them. We will explore the following concepts in this chapter:

Program to process
Process layout
Virtual address spaces
Kernel and user space
Process APIs
Process descriptors
Kernel stack management
Threads
Linux thread API
Data structures
Namespace and cgroups

Processes

Quintessentially, computing systems are designed, developed, and often tweaked for running user applications efficiently. Every element that goes into a computing platform is intended to enable effective and efficient ways for running applications. In other words, computing systems exist to run diverse application programs. Applications can run either as firmware in dedicated devices or as a "process" in systems driven by system software (operating systems).

At its core, a process is a running instance of a program in memory. The transformation from a program to a process happens when the program (on disk) is fetched into memory for execution.

A program’s binary image carries code (with all its binary instructions) and data (with all global data), which are mapped to distinct regions of memory with appropriate access permissions (read, write, and execute). Apart from code and data, a process is assigned additional memory regions called stack (for allocation of function call frames with auto variables and function arguments) and heap for dynamic allocations at runtime.

Multiple instances of the same program can exist with their respective memory allocations. For instance, for a web browser with multiple open tabs (running simultaneous browsing sessions), each tab is considered a process instance by the kernel, with unique memory allocations.

The following figure represents the layout of processes in memory:

The illusion called address space

Modern-day computing platforms are expected to handle a plethora of processes efficiently. Operating systems thus must deal with allocating unique memory to all contending processes within the physical memory (often finite) and also ensure their reliable execution. With multiple processes contending and executing simultaneously (multi-tasking), the operating system must ensure that the memory allocation of every process is protected from accidental access by another process.

To address this issue, the kernel provides a level of abstraction between the process and the physical memory called virtualaddress space. Virtual address space is the process' view of memory; it is how the running program views the memory.

Virtual address space creates an illusion that every process exclusively owns the whole memory while executing. This abstracted view of memory is called virtual memory and is achieved by the kernel's memory manager in coordination with the CPU's MMU. Each process is given a contiguous 32 or 64-bit address space, bound by the architecture and unique to that process. With each process caged into its virtual address space by the MMU, any attempt by a process to access an address region outside its boundaries will trigger a hardware fault, making it possible for the memory manger to detect and terminate violating processes, thus ensuring protection.

The following figure depicts the illusion of address space created for every contending process:

Kernel and user space

Modern operating systems not only prevent one process from accessing another but also prevent processes from accidentally accessing or manipulating kernel data and services (as the kernel is shared by all the processes).

Operating systems achieve this protection by segmenting the whole memory into two logical halves, the user and kernel space. This bifurcation ensures that all processes that are assigned address spaces are mapped to the user space section of memory and kernel data and services run in kernel space. The kernel achieves this protection in coordination with the hardware. While an application process is executing instructions from its code segment, the CPU is operating in user mode. When a process intends to invoke a kernel service, it needs to switch the CPU into privileged mode (kernel mode), which is achieved through special functions called APIs (application programming interfaces). These APIs enable user processes to switch into the kernel space using special CPU instructions and then execute the required services through system calls. On completion of the requested service, the kernel executes another mode switch, this time back from kernel mode to user mode, using another set of CPU instructions.

Note

System calls are the kernel's interfaces to expose its services to application processes; they are also called kernel entry points. As system calls are implemented in kernel space, the respective handlers are provided through APIs in the user space. API abstraction also makes it easier and convenient to invoke related system calls.

The following figure depicts a virtualized memory view:

Process context

When a process requests a kernel service through a system call, the kernel will execute on behalf of the caller process. The kernel is now said to be executing in process context. Similarly, the kernel also responds to interrupts raised by other hardware entities; here, the kernel executes in interrupt context. When in interrupt context, the kernel is not running on behalf of any process.

Process descriptors

Right from the time a process is born until it exits, it’s the kernel's process management subsystem that carries out various operations, ranging from process creation, allocating CPU time, and event notifications to destruction of the process upon termination.

Apart from the address space, a process in memory is also assigned a data structure called the process descriptor, which the kernel uses to identify, manage, and schedule the process. The following figure depicts process address spaces with their respective process descriptors in the kernel:

In Linux, a process descriptor is an instance of type struct task_struct defined in <linux/sched.h>, it is one of the central data structures, and contains all the attributes, identification details, and resource allocation entries that a process holds. Looking at struct task_struct is like a peek into the window of what the kernel sees or works with to manage and schedule a process.

Since the task structure contains a wide set of data elements, which are related to the functionality of various kernel subsystems, it would be out of context to discuss the purpose and scope of all the elements in this chapter. We shall consider a few important elements that are related to process management.

Process attributes - key elements

Process attributes define all the key and fundamental characteristics of a process. These elements contain the process's state and identifications along with other key values of importance.

state

A process right from the time it is spawned until it exits may exist in various states, referred to as process states--they define the process’s current state:

TASK_RUNNING (0): The task is either executing or contending for CPU in the scheduler run-queue.
TASK_INTERRUPTIBLE (1): The task is in an interruptible wait state; it remains in wait until an awaited condition becomes true, such as the availability of mutual exclusion locks, device ready for I/O, lapse of sleep time, or an exclusive wake-up call. While in this wait state, any signals generated for the process are delivered, causing it to wake up before the wait condition is met.
TASK_KILLABLE: This is similar to TASK_INTERRUPTIBLE, with the exception that interruptions can only occur on fatal signals, which makes it a better alternative to TASK_INTERRUPTIBLE.
TASK_UNINTERRUTPIBLE (2): The task is in uninterruptible wait state similar to TASK_INTERRUPTIBLE, except that generated signals to the sleeping process do not cause wake-up. When the event occurs for which it is waiting, the process transitions to TASK_RUNNING. This process state is rarely used.
TASK_ STOPPED (4): The task has received a STOP signal. It will be back to running on receiving the continue signal (SIGCONT).
TASK_TRACED (8): A process is said to be in traced state when it is being combed, probably by a debugger.
EXIT_ZOMBIE (32): The process is terminated, but its resources are not yet reclaimed.
EXIT_DEAD (16): The child is terminated and all the resources held by it freed, after the parent collects the exit status of the child using wait.

The following figure depicts process states:

pid

This field contains a unique process identifier referred to as PID. PIDs in Linux are of the type pid_t (integer). Though a PID is an integer, the default maximum number PIDs is 32,768 specified through the /proc/sys/kernel/pid_max interface. The value in this file can be set to any value up to 2²² (PID_MAX_LIMIT, approximately 4 million).

To manage PIDs, the kernel uses a bitmap. This bitmap allows the kernel to keep track of PIDs in use and assign a unique PID for new processes. Each PID is identified by a bit in the PID bitmap; the value of a PID is determined from the position of its corresponding bit. Bits with value 1 in the bitmap indicate that the corresponding PIDs are in use, and those with value 0 indicate free PIDs. Whenever the kernel needs to assign a unique PID, it looks for the first unset bit and sets it to 1, and conversely to free a PID, it toggles the corresponding bit from 1 to 0.

tgid

This field contains the thread group id. For easy understanding, let's say when a new process is created, its PID and TGID are the same, as the process happens to be the only thread. When the process spawns a new thread, the new child gets a unique PID but inherits the TGID from the parent, as it belongs to the same thread group. The TGID is primarily used to support multi-threaded process. We will delve into further details in the threads section of this chapter.

thread info

This field holds processor-specific state information, and is a critical element of the task structure. Later sections of this chapter contain details about the importance of thread_info.

flags

The flags field records various attributes corresponding to a process. Each bit in the field corresponds to various stages in the lifetime of a process. Per-process flags are defined in <linux/sched.h>:

#define PF_EXITING           /* getting shut down */
#define PF_EXITPIDONE        /* pi exit done on shut down */
#define PF_VCPU              /* I'm a virtual CPU */
#define PF_WQ_WORKER         /* I'm a workqueue worker */
#define PF_FORKNOEXEC        /* forked but didn't exec */
#define PF_MCE_PROCESS       /* process policy on mce errors */
#define PF_SUPERPRIV         /* used super-user privileges */
#define PF_DUMPCORE          /* dumped core */
#define PF_SIGNALED          /* killed by a signal */
#define PF_MEMALLOC          /* Allocating memory */
#define PF_NPROC_EXCEEDED    /* set_user noticed that RLIMIT_NPROC was exceeded */
#define PF_USED_MATH         /* if unset the fpu must be initialized before use */
#define PF_USED_ASYNC        /* used async_schedule*(), used by module init */
#define PF_NOFREEZE          /* this thread should not be frozen */
#define PF_FROZEN            /* frozen for system suspend */
#define PF_FSTRANS           /* inside a filesystem transaction */
#define PF_KSWAPD            /* I am kswapd */
#define PF_MEMALLOC_NOIO0    /* Allocating memory without IO involved */
#define PF_LESS_THROTTLE     /* Throttle me less: I clean memory */
#define PF_KTHREAD           /* I am a kernel thread */
#define PF_RANDOMIZE         /* randomize virtual address space */
#define PF_SWAPWRITE         /* Allowed to write to swap */
#define PF_NO_SETAFFINITY    /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY         /* Early kill for mce process policy */
#define PF_MUTEX_TESTER      /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP      /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK      /* this thread called freeze_processes and should not be frozen */

exit_code and exit_signal

These fields contain the exit value of the task and details of the signal that caused the termination. These fields are to be accessed by the parent process through wait() on termination of the child.

comm

This field holds the name of the binary executable used to start the process.

ptrace

This field is enabled and set when the process is put into trace mode using the ptrace() system call.

Process relations - key elements

Every process can be related to a parent process, establishing a parent-child relationship. Similarly, multiple processes spawned by the same process are called siblings. These fields establish how the current process relates to another process.

real_parent and parent

These are pointers to the parent's task structure. For a normal process, both these pointers refer to the same task_struct; they only differ for multi-thread processes, implemented using posix threads. For such cases, real_parent refers to the parent thread task structure and parent refers the process task structure to which SIGCHLD is delivered.

children

This is a pointer to a list of child task structures.

sibling

This is a pointer to a list of sibling task structures.

group_leader

This is a pointer to the task structure of the process group leader.

Scheduling attributes - key elements

All contending processes must be given fair CPU time, and this calls for scheduling based on time slices and process priorities. These attributes contain necessary information that the scheduler uses when deciding on which process gets priority when contending.

prio and static_prio

prio helps determine the priority of the process for scheduling. This field holds static priority of the process within the range 1 to 99 (as specified by sched_setscheduler()) if the process is assigned a real-time scheduling policy. For normal processes, this field holds a dynamic priority derived from the nice value.

se, rt, and dl

Every task belongs to a scheduling entity (group of tasks), as scheduling is done at a per-entity level. se is for all normal processes, rt is for real-time processes, and dl is for deadline processes. We will discuss more on these attributes in the next chapter on scheduling.

policy

This field contains information about the scheduling policy of the process, which helps in determining its priority.

cpus_allowed

This field specifies the CPU mask for the process, that is, on which CPU(s) the process is eligible to be scheduled in a multi-processor system.

rt_priority

This field specifies the priority to be applied by real-time scheduling policies. For non-real-time processes, this field is unused.

Process limits - key elements

The kernel imposes resource limits to ensure fair allocation of system resources among contending processes. These limits guarantee that a random process does not monopolize ownership of resources. There are 16 different types of resource limits, and the task structure points to an array of type struct rlimit, in which each offset holds the current and maximum values for a specific resource.

/*include/uapi/linux/resource.h*/
struct rlimit {
  __kernel_ulong_t        rlim_cur;
  __kernel_ulong_t        rlim_max;
};
These limits are specified in include/uapi/asm-generic/resource.h

 #define RLIMIT_CPU        0       /* CPU time in sec */
 #define RLIMIT_FSIZE      1       /* Maximum filesize */
 #define RLIMIT_DATA       2       /* max data size */
 #define RLIMIT_STACK      3       /* max stack size */
 #define RLIMIT_CORE       4       /* max core file size */
 #ifndef RLIMIT_RSS
 # define RLIMIT_RSS       5       /* max resident set size */
 #endif
 #ifndef RLIMIT_NPROC
 # define RLIMIT_NPROC     6       /* max number of processes */
 #endif
 #ifndef RLIMIT_NOFILE
 # define RLIMIT_NOFILE    7       /* max number of open files */
 #endif
 #ifndef RLIMIT_MEMLOCK
 # define RLIMIT_MEMLOCK   8       /* max locked-in-memory   
 address space */
 #endif
 #ifndef RLIMIT_AS
 # define RLIMIT_AS        9       /* address space limit */
 #endif
 #define RLIMIT_LOCKS      10      /* maximum file locks held */
 #define RLIMIT_SIGPENDING 11      /* max number of pending signals */
 #define RLIMIT_MSGQUEUE   12      /* maximum bytes in POSIX mqueues */
 #define RLIMIT_NICE       13      /* max nice prio allowed to 
 raise to 0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO     14      /* maximum realtime priority */
 #define RLIMIT_RTTIME     15      /* timeout for RT tasks in us */
 #define RLIM_NLIMITS      16

File descriptor table - key elements

During the lifetime of a process, it may access various resource files to get its task done. This results in the process opening, closing, reading, and writing to these files. The system must keep track of these activities; file descriptor elements help the system know which files the process holds.

fs

Filesystem information is stored in this field.

files

The file descriptor table contains pointers to all the files that a process opens to perform various operations. The files field contains a pointer, which points to this file descriptor table.

Signal descriptor - key elements

For processes to handle signals, the task structure has various elements that determine how the signals must be handled.

signal

This is of type struct signal_struct, which contains information on all the signals associated with the process.

sighand

This is of type struct sighand_struct, which contains all signal handlers associated with the process.

sigset_t blocked, real_blocked

These elements identify signals that are currently masked or blocked by the process.

pending

This is of type struct sigpending, which identifies signals which are generated but not yet delivered.

sas_ss_sp

This field contains a pointer to an alternate stack, which facilitates signal handling.

sas_ss_size

This filed shows the size of the alternate stack, used for signal handling.

Kernel stack

With current-generation computing platforms powered by multi-core hardware capable of running simultaneous applications, the possibility of multiple processes concurrently initiating kernel mode switch when requesting for the same process is built in. To be able to handle such situations, kernel services are designed to be re-entrant, allowing multiple processes to step in and engage the required services. This mandated the requesting process to maintain its own private kernel stack to keep track of the kernel function call sequence, store local data of the kernel functions, and so on.

The kernel stack is directly mapped to the physical memory, mandating the arrangement to be physically in a contiguous region. The kernel stack by default is 8kb for x86-32 and most other 32-bit systems (with an option of 4k kernel stack to be configured during kernel build), and 16kb on an x86-64 system.

When kernel services are invoked in the current process context, they need to validate the process’s prerogative before it commits to any relevant operations. To perform such validations, the kernel services must gain access to the task structure of the current process and look through the relevant fields. Similarly, kernel routines might need to have access to the current task structure for modifying various resource structures such as signal handler tables, looking for pending signals, file descriptor table, and memory descriptor among others. To enable accessing the task structure at runtime, the address of the current task structure is loaded into a processor register (register chosen is architecture specific) and made available through a kernel global macro called current (defined in architecture-specific kernel header asm/current.h ):

  /* arch/ia64/include/asm/current.h */
  #ifndef _ASM_IA64_CURRENT_H
  #define _ASM_IA64_CURRENT_H
  /*
  * Modified 1998-2000
  *      David Mosberger-Tang <davidm@hpl.hp.com>, Hewlett-Packard Co
  */
  #include <asm/intrinsics.h>
  /*
  * In kernel mode, thread pointer (r13) is used to point to the 
    current task
  * structure.
  */
 #define current ((struct task_struct *) ia64_getreg(_IA64_REG_TP))
 #endif /* _ASM_IA64_CURRENT_H */
 /* arch/powerpc/include/asm/current.h */
 #ifndef _ASM_POWERPC_CURRENT_H
 #define _ASM_POWERPC_CURRENT_H
 #ifdef __KERNEL__
 /*
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 */
 struct task_struct;
 #ifdef __powerpc64__
 #include <linux/stddef.h>
 #include <asm/paca.h>
 static inline struct task_struct *get_current(void)
 {
       struct task_struct *task;

       __asm__ __volatile__("ld %0,%1(13)"
       : "=r" (task)
       : "i" (offsetof(struct paca_struct, __current)));
       return task;
 }
 #define current get_current()
 #else
 /*
 * We keep `current' in r2 for speed.
 */
register struct task_struct *current asm ("r2");
 #endif
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_CURRENT_H */

However, in register-constricted architectures, where there are few registers to spare, reserving a register to hold the address of the current task structure is not viable. On such platforms, the task structure of the current process is directly made available at the top of the kernel stack that it owns. This approach renders a significant advantage with respect to locating the task structure, by just masking the least significant bits of the stack pointer.

With the evolution of the kernel, the task structure grew and became too large to be contained in the kernel stack, which is already restricted in physical memory (8Kb). As a result, the task structure was moved out of the kernel stack, barring a few key fields that define the process's CPU state and other low-level processor-specific information. These fields were then wrapped in a newly created structure called struct thread_info. This structure is contained on top of the kernel stack and provides a pointer that refers to the current task structure, which can be used by kernel services.

The following code snippet shows struct thread_info for x86 architecture (kernel 3.10):

/* linux-3.10/arch/x86/include/asm/thread_info.h */

struct thread_info {
struct task_struct *task; /* main task structure */
 struct exec_domain *exec_domain; /* execution domain */
 __u32 flags; /* low level flags */
 __u32 status; /* thread synchronous flags */
 __u32 cpu; /* current CPU */
 int preempt_count; /* 0 => preemptable, <0 => BUG */
 mm_segment_t addr_limit;
 struct restart_block restart_block;
 void __user *sysenter_return;
 #ifdef CONFIG_X86_32
 unsigned long previous_esp; /* ESP of the previous stack in case of   
 nested (IRQ) stacks */
 __u8 supervisor_stack[0];
 #endif
 unsigned int sig_on_uaccess_error:1;
 unsigned int uaccess_err:1; /* uaccess failed */
};

With thread_info containing process-related information, apart from task structure, the kernel has multiple viewpoints to the current process structure: struct task_struct, an architecture-independent information block, and thread_info, an architecture-specific one. The following figure depicts thread_info and task_struct:

For architectures that engage thread_info, the current macro's implementation is modified to look into the top of kernel stack to obtain a reference to the current thread_info and through it the current task structure. The following code snippet shows the implementation of current for an x86-64 platform:

  #ifndef __ASM_GENERIC_CURRENT_H
  #define __ASM_GENERIC_CURRENT_H
  #include <linux/thread_info.h>

#define get_current() (current_thread_info()->task)
#define current get_current()

  #endif /* __ASM_GENERIC_CURRENT_H */
  /*
  * how to get the current stack pointer in C
  */
  register unsigned long current_stack_pointer asm ("sp");

  /*
   * how to get the thread information struct from C
   */
  static inline struct thread_info *current_thread_info(void)  
  __attribute_const__;

  static inline struct thread_info *current_thread_info(void)
  {
 return (struct thread_info *)
                (current_stack_pointer & ~(THREAD_SIZE - 1));
  }

As use of PER_CPU variables has increased in recent times, the process scheduler is tuned to cache crucial current process-related information in the PER_CPU area. This change enables quick access to current process data over looking up the kernel stack. The following code snippet shows the implementation of the current macro to fetch the current task data through the PER_CPU variable:

  #ifndef _ASM_X86_CURRENT_H
  #define _ASM_X86_CURRENT_H

  #include <linux/compiler.h>
  #include <asm/percpu.h>

  #ifndef __ASSEMBLY__
  struct task_struct;

DECLARE_PER_CPU(struct task_struct *, current_task);

  static __always_inline struct task_struct *get_current(void)
  {
return this_cpu_read_stable(current_task);
  }

#define current get_current()

  #endif /* __ASSEMBLY__ */

  #endif /* _ASM_X86_CURRENT_H */

The use of PER_CPU data led to a gradual reduction of information in thread_info. With thread_info shrinking in size, kernel developers are considering getting rid of thread_info altogether by moving it into the task structure. As this involves changes to low-level architecture code, it has only been implemented for the x86-64 architecture, with other architectures planned to follow. The following code snippet shows the current state of the thread_info structure with just one element:

/* linux-4.9.10/arch/x86/include/asm/thread_info.h */
struct thread_info {
 unsigned long flags; /* low level flags */
};

The issue of stack overflow

Unlike user mode, the kernel mode stack lives in directly mapped memory. When a process invokes a kernel service, which may internally be deeply nested, chances are that it may overrun into immediate memory range. The worst part of it is the kernel will be oblivious to such occurrences. Kernel programmers usually engage various debug options to track stack usage and detect overruns, but these methods are not handy to prevent stack breaches on production systems.Conventional protection through the use of guard pages is also ruled out here (as it wastes an actual memory page).

Kernel programmers tend to follow coding standards--minimizing the use of local data, avoiding recursion, and avoiding deep nesting among others--to cut down the probability of a stack breach. However, implementation of feature-rich and deeply layered kernel subsystems may pose various design challenges and complications, especially with the storage subsystem where filesystems, storage drivers, and networking code can be stacked up in several layers, resulting in deeply nested function calls.

The Linux kernel community has been pondering over preventing such breaches for quite long, and toward that end, the decision was made to expand the kernel stack to 16kb (x86-64, since kernel 3.15). Expansion of the kernel stack might prevent some breaches, but at the cost of engaging much of the directly mapped kernel memory for the per-process kernel stack. However, for reliable functioning of the system, it is expected of the kernel to elegantly handle stack breaches when they show up on production systems.

With the 4.9 release, the kernel has come with a new system to set up virtually mapped kernel stacks. Since virtual addresses are currently in use to map even a directly mapped page, principally the kernel stack does not actually require physically contiguous pages. The kernel reserves a separate range of addresses for virtually mapped memory, and addresses from this range are allocated when a call to vmalloc() is made. This range of memory is referred as the vmalloc range. Primarily this range is used when programs require huge chunks of memory which are virtually contiguous but physically scattered. Using this, the kernel stack can now be allotted as individual pages, mapped to the vmalloc range. Virtual mapping also enables protection from overruns as a no-access guard page can be allocated with a page table entry (without wasting an actual page). Guard pages would prompt the kernel to pop an oops message on memory overrun and initiate a kill against overrunning process.

Virtually mapped kernel stacks with guard pages are currently available only for the x86-64 architecture (support for other architectures seemingly to follow). This can be enabled by choosing the HAVE_ARCH_VMAP_STACK or CONFIG_VMAP_STACK build-time options.

Process creation

During kernel boot, a kernel thread called init is spawned, which in turn is configured to initialize the first user-mode process (with the same name). The init (pid 1) process is then configured to carry out various initialization operations specified through configuration files, creating multiple processes. Every child process further created (which may in turn create its own child process(es)) are all descendants of the init process. Processes thus created end up in a tree-like structure or a single hierarchy model. The shell, which is one such process, becomes the interface for users to create user processes, when programs are called for execution.

Fork, vfork, exec, clone, wait and exit are the core kernel interfaces for the creation and control of new process. These operations are invoked through corresponding user-mode APIs.

fork()

Fork() is one of the core "Unix thread APIs" available across *nix systems since the inception of legacy Unix releases. Aptly named, it forks a new process from a running process. When fork() succeeds, the new process is created (referred to as child) by duplicating the caller's address space and task structure. On return from fork(), both caller (parent) and new process (child) resume executing instructions from the same code segment which was duplicated under copy-on-write. Fork() is perhaps the only API that enters kernel mode in the context of caller process, and on success returns to user mode in the context of both caller and child (new process).

Most resource entries of the parent's task structure such as memory descriptor, file descriptor table, signal descriptors, and scheduling attributes are inherited by the child, except for a few attributes such as memory locks, pending signals, active timers, and file record locks (for the full list of exceptions, refer to the fork(2) man page). A child process is assigned a unique pid and will refer to its parent's pid through the ppid field of its task structure; the child’s resource utilization and processor usage entries are reset to zero.

The parent process updates itself about the child’s state using the wait() system call and normally waits for the termination of the child process. Failing to call wait(), the child may terminate and be pushed into a zombie state.

Copy-on-write (COW)

Duplication of parent process to create a child needs cloning of the user mode address space (stack, data, code, and heap segments) and task structure of the parent for the child; this would result in execution overhead that leads to un-deterministic process-creation time. To make matters worse, this process of cloning would be rendered useless if neither parent nor child did not initiate any state-change operations on cloned resources.

As per COW, when a child is created, it is allocated a unique task structure with all resource entries (including page tables) referring to the parent's task structure, with read-only access for both parent and child. Resources are truly duplicated when either of the processes initiates a state change operation, hence the name copy-on-write (write in COW implies a state change). COW does bring effectiveness and optimization to the fore, by deferring the need for duplicating process data until write, and in cases where only read happens, it avoids it altogether. This on-demand copying also reduces the number of swap pages needed, cuts down the time spent on swapping, and might help reduce demand paging.

exec

At times creating a child process might not be useful, unless it runs a new program altogether: the exec family of calls serves precisely this purpose. exec replaces the existing program in a process with a new executable binary:

#include <unistd.h>
int execve(const char *filename, char *const argv[],
char *const envp[]);

The execve is the system call that executes the program binary file, passed as the first argument to it. The second and third arguments are null-terminated arrays of arguments and environment strings, to be passed to a new program as command-line arguments. This system call can also be invoked through various glibc (library) wrappers, which are found to be more convenient and flexible:

#include <unistd.h>
extern char **environ;
int execl(const char *path, const char *arg, ...);
int execlp(const char *file, const char *arg, ...);
int execle(const char *path, const char *arg,
..., char * const envp[]);
int execv(const char *path, char *constargv[]);
int execvp(const char *file, char *constargv[]);
int execvpe(const char *file, char *const argv[],
char *const envp[]);

Command-line user-interface programs such as shell use the exec interface to launch user-requested program binaries.

vfork()

Unlike fork(), vfork() creates a child process and blocks the parent, which means that the child runs as a single thread and does not allow concurrency; in other words, the parent process is temporarily suspended until the child exits or call exec(). The child shares the data of the parent.

Linux support for threads

The flow of execution in a process is referred to as a thread, which implies that every process will at least have one thread of execution. Multi-threaded means the existence of multiple flows of execution contexts in a process. With modern many-core architectures, multiple flows of execution in a process can be truly concurrent, achieving fair multitasking.

Threads are normally enumerated as pure user-level entities within a process that are scheduled for execution; they share parent's virtual address space and system resources. Each thread maintains its code, stack, and thread local storage. Threads are scheduled and managed by the thread library, which uses a structure referred to as a thread object to hold a unique thread identifier, for scheduling attributes and to save the thread context. User-level thread applications are generally lighter on memory, and are the preferred model of concurrency for event-driven applications. On the flip side, such user-level thread model is not suitable for parallel computing, since they are tied onto the same processor core to which their parent process is bound.

Linux doesn’t support user-level threads directly; it instead proposes an alternate API to enumerate a special process, called light weight process (LWP), that can share a set of configured resources such as dynamic memory allocations, global data, open files, signal handlers, and other extensive resources with the parent process. Each LWP is identified by a unique PID and task structure, and is treated by the kernel as an independent execution context. In Linux, the term thread invariably refers to LWP, since each thread initialized by the thread library (Pthreads) is enumerated as an LWP by the kernel.

clone()

clone() is a Linux-specific system call to create a new process; it is considered a generic version of the fork() system call, offering finer controls to customize its functionality through the flags argument:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

It provides more than twenty different CLONE_* flags that control various aspects of the clone operation, including whether the parent and child process share resources such as virtual memory, open file descriptors, and signal dispositions. The child is created with the appropriate memory address (passed as the second argument) to be used as the stack (for storing the child's local data). The child process starts its execution with its start function (passed as the first argument to the clone call).

When a process attempts to create a thread through the pthread library, clone() is invoked with the following flags:

/*clone flags for creating threads*/
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID;

The clone() can also be used to create a regular child process that is normally spawned using fork() and vfork():

/* clone flags for forking child */
flags = SIGCHLD;
/* clone flags for vfork child */ 
flags = CLONE_VFORK | CLONE_VM | SIGCHLD;

Kernel threads

To augment the need for running background operations, the kernel spawns threads (similar to processes). These kernel threads are similar to regular processes, in that they are represented by a task structure and assigned a PID. Unlike user processes, they do not have any address space mapped, and run exclusively in kernel mode, which makes them non-interactive. Various kernel subsystems use kthreads to run periodic and asynchronous operations.

All kernel threads are descendants of kthreadd (pid 2), which is spawned by the kernel (pid 0) during boot. The kthreadd enumerates other kernel threads; it provides interface routines through which other kernel threads can be dynamically spawned at runtime by kernel services. Kernel threads can be viewed from the command line with the ps -ef command--they are shown in [square brackets]:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 22:43 ? 00:00:01 /sbin/init splash
root 2 0 0 22:43 ? 00:00:00 [kthreadd]
root 3 2 0 22:43 ? 00:00:00 [ksoftirqd/0]
root 4 2 0 22:43 ? 00:00:00 [kworker/0:0]
root 5 2 0 22:43 ? 00:00:00 [kworker/0:0H]
root 7 2 0 22:43 ? 00:00:01 [rcu_sched]
root 8 2 0 22:43 ? 00:00:00 [rcu_bh]
root 9 2 0 22:43 ? 00:00:00 [migration/0]
root 10 2 0 22:43 ? 00:00:00 [watchdog/0]
root 11 2 0 22:43 ? 00:00:00 [watchdog/1]
root 12 2 0 22:43 ? 00:00:00 [migration/1]
root 13 2 0 22:43 ? 00:00:00 [ksoftirqd/1]
root 15 2 0 22:43 ? 00:00:00 [kworker/1:0H]
root 16 2 0 22:43 ? 00:00:00 [watchdog/2]
root 17 2 0 22:43 ? 00:00:00 [migration/2]
root 18 2 0 22:43 ? 00:00:00 [ksoftirqd/2]
root 20 2 0 22:43 ? 00:00:00 [kworker/2:0H]
root 21 2 0 22:43 ? 00:00:00 [watchdog/3]
root 22 2 0 22:43 ? 00:00:00 [migration/3]
root 23 2 0 22:43 ? 00:00:00 [ksoftirqd/3]
root 25 2 0 22:43 ? 00:00:00 [kworker/3:0H]
root 26 2 0 22:43 ? 00:00:00 [kdevtmpfs]
/*kthreadd creation code (init/main.c) */
static noinline void __ref rest_init(void)
{
 int pid;

 rcu_scheduler_starting();
 /*
 * We need to spawn init first so that it obtains pid 1, however
 * the init task will end up wanting to create kthreads, which, if
 * we schedule it before we create kthreadd, will OOPS.
 */
 kernel_thread(kernel_init, NULL, CLONE_FS);
 numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
 rcu_read_lock();
 kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
 rcu_read_unlock();
 complete(&kthreadd_done);

 /*
 * The boot idle thread must execute schedule()
 * at least once to get things moving:
 */
 init_idle_bootup_task(current);
 schedule_preempt_disabled();
 /* Call into cpu_idle with preempt disabled */
 cpu_startup_entry(CPUHP_ONLINE);
}

The previous code shows the kernel boot routine rest_init() invoking the kernel_thread() routine with appropriate arguments to spawn both the kernel_init thread (which then goes on to start the user-mode init process) and kthreadd.

The kthread is a perpetually running thread that looks into a list called kthread_create_list for data on new kthreads to be created:

/*kthreadd routine(kthread.c) */
int kthreadd(void *unused)
{
 struct task_struct *tsk = current;

 /* Setup a clean context for our children to inherit. */
 set_task_comm(tsk, "kthreadd");
 ignore_signals(tsk);
 set_cpus_allowed_ptr(tsk, cpu_all_mask);
 set_mems_allowed(node_states[N_MEMORY]);

 current->flags |= PF_NOFREEZE;

for (;;) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (list_empty(&kthread_create_list))
 schedule();
 __set_current_state(TASK_RUNNING);

 spin_lock(&kthread_create_lock);
while (!list_empty(&kthread_create_list)) {
 struct kthread_create_info *create;

 create = list_entry(kthread_create_list.next,
 struct kthread_create_info, list);
 list_del_init(&create->list);
 spin_unlock(&kthread_create_lock);

create_kthread(create); /* creates kernel threads with attributes enqueued */

 spin_lock(&kthread_create_lock);
 }
 spin_unlock(&kthread_create_lock);
 }

 return 0;
}

Kernel threads are created by invoking either kthread_create or through its wrapper kthread_run by passing appropriate arguments that define the kthreadd (start routine, ARG data to start routine, and name). The following code snippet shows kthread_create invoking kthread_create_on_node(), which by default creates threads on the current Numa node:

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 void *data,
 int node,
 const char namefmt[], ...);

/**
 * kthread_create - create a kthread on the current node
 * @threadfn: the function to run in the thread
 * @data: data pointer for @threadfn()
 * @namefmt: printf-style format string for the thread name
 * @...: arguments for @namefmt.
 *
 * This macro will create a kthread on the current node, leaving it in
 * the stopped state. This is just a helper for       
 * kthread_create_on_node();
 * see the documentation there for more details.
 */
#define kthread_create(threadfn, data, namefmt, arg...) 
 kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)


struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 void *data,
 unsigned int cpu,
 const char *namefmt);

/**
 * kthread_run - create and wake a thread.
 * @threadfn: the function to run until signal_pending(current).
 * @data: data ptr for @threadfn.
 * @namefmt: printf-style name for the thread.
 *
* Description: Convenient wrapper for kthread_create() followed by
 * wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).
 */
#define kthread_run(threadfn, data, namefmt, ...) 
({ 
 struct task_struct *__k 
 = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); 
 if (!IS_ERR(__k)) 
 wake_up_process(__k); 
 __k; 
})

kthread_create_on_node() instantiates details (received as arguments) of kthread to be created into a structure of type kthread_create_info and queues it at the tail of kthread_create_list. It then wakes up kthreadd and waits for thread creation to complete:

/* kernel/kthread.c */
static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
 void *data, int node,
 const char namefmt[],
 va_list args)
{
 DECLARE_COMPLETION_ONSTACK(done);
 struct task_struct *task;
struct kthread_create_info *create = kmalloc(sizeof(*create),
 GFP_KERNEL);

 if (!create)
 return ERR_PTR(-ENOMEM);
create->threadfn = threadfn;
 create->data = data;
 create->node = node;
 create->done = &done;

 spin_lock(&kthread_create_lock);
list_add_tail(&create->list, &kthread_create_list);
 spin_unlock(&kthread_create_lock);

wake_up_process(kthreadd_task);
 /*
 * Wait for completion in killable state, for I might be chosen by
 * the OOM killer while kthreadd is trying to allocate memory for
 * new kernel thread.
 */
 if (unlikely(wait_for_completion_killable(&done))) {
 /*
 * If I was SIGKILLed before kthreadd (or new kernel thread)
 * calls complete(), leave the cleanup of this structure to
 * that thread.
 */
 if (xchg(&create->done, NULL))
 return ERR_PTR(-EINTR);
/*
 * kthreadd (or new kernel thread) will call complete()
 * shortly.
 */
 wait_for_completion(&done); // wakeup on completion of thread creation.
 }
...
...
...
}

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 void *data, int node,
 const char namefmt[],
 ...)
{
 struct task_struct *task;
 va_list args;

 va_start(args, namefmt);
task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
 va_end(args);

 return task;
}

Recall that kthreadd invokes the create_thread() routine to start kernel threads as per data queued into the list. This routine creates the thread and signals completion:

/* kernel/kthread.c */
static void create_kthread(struct kthread_create_info *create)
{
 int pid;

 #ifdef CONFIG_NUMA
 current->pref_node_fork = create->node;
 #endif

 /* We want our own signal handler (we take no signals by default). */
pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES |  
 SIGCHLD);
 if (pid < 0) {
 /* If user was SIGKILLed, I release the structure. */
 struct completion *done = xchg(&create->done, NULL);

 if (!done) {
 kfree(create);
 return;
 }
 create->result = ERR_PTR(pid);
complete(done); /* signal completion of thread creation */
 }
}

do_fork() and copy_process()

All of the process/thread creation calls discussed so far invoke different system calls (except create_thread) to step into kernel mode. All of those system calls in turn converge into the common kernel function _do_fork(), which is invoked with distinct CLONE_* flags. do_fork() internally falls back on copy_process() to complete the task. The following figure sums up the call sequence for process creation:

/* kernel/fork.c */
/*
 * Create a kernel thread.
 */

pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
 (unsigned long)arg, NULL, NULL, 0);
}

/* sys_fork: create a child process by duplicating caller */
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
#else
 /* cannot support in nommu mode */
 return -EINVAL;
#endif
}

/* sys_vfork: create vfork child process */
SYSCALL_DEFINE0(vfork)
{
return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
 0, NULL, NULL, 0);
}

/* sys_clone: create child process as per clone flags */

#ifdef __ARCH_WANT_SYS_CLONE
#ifdef CONFIG_CLONE_BACKWARDS
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 int __user *, parent_tidptr,
 unsigned long, tls,
 int __user *, child_tidptr)
#elif defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
 int __user *, parent_tidptr,
 int __user *, child_tidptr,
 unsigned long, tls)
#elif defined(CONFIG_CLONE_BACKWARDS3)
SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
 int, stack_size,
 int __user *, parent_tidptr,
 int __user *, child_tidptr,
 unsigned long, tls)
#else
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 int __user *, parent_tidptr,
 int __user *, child_tidptr,
 unsigned long, tls)
#endif
{
return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
}
#endif

Process status and termination

During the lifetime of a process, it traverses through many states before it ultimately terminates. Users must have proper mechanisms to be updated with all that happens to a process during its lifetime. Linux provides a set of functions for this purpose.

wait

For processes and threads created by a parent, it might be functionally useful for the parent to know the execution status of the child process/thread. This can be achieved using the wait family of system calls:

#include <sys/types.h>
#include <sys/wait.h>
pid_t wait(int *status);
pid_t waitpid(pid_t pid, int *status, intoptions);
int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options)

These system calls update the calling process with the state change events of a child. The following state change events are notified:

Termination of child
Stopped by a signal
Resumed by a signal

In addition to reporting the status, these APIs allow the parent process to reap a terminated child. A process on termination is put into zombie state until the immediate parent engages the wait call to reap it.

exit

Every process must end. Process termination is done either by the process calling exit() or when the main function returns. A process may also be terminated abruptly on receiving a signal or exception that forces it to terminate, such as the KILL command, which sends a signal to kill the process, or when an exception is raised. Upon termination, the process is put into exit state until the immediate parent reaps it.

The exit calls the sys_exit system call, which internally calls the do_exit routine. The do_exit primarily performs the following tasks (do_exit sets many values and makes multiple calls to related kernel routines to complete its task):

Takes the exit code returned by the child to the parent.
Sets the PF_EXITING flag, indicating process exiting.
Cleans up and reclaims the resources held by the process. This includes releasing mm_struct, removal from the queue if it is waiting for an IPC semaphore, release of filesystem data and files, if any, and calling schedule() as the process is no longer executable.

After do_exit, the process remains in zombie state and the process descriptor is still intact for the parent to collect the status, after which the resources are reclaimed by the system.

Namespaces and cgroups

Users logged into a Linux system have a transparent view of various system entities such as global resources, processes, kernel, and users. For instance, a valid user can access PIDs of all running processes on the system (irrespective of the user to which they belong). Users can observe the presence of other users on the system, and they can run commands to view the state of global system global resources such as memory, filesystem mounts, and devices. Such operations are not deemed as intrusions or considered security breaches, as it is always guaranteed that one user/process can never intrude into other user/process.

However, such transparency is unwarranted on a few server platforms. For instance, consider cloud service providers offering PaaS (platform as a service). They offer an environment to host and deploy custom client applications. They manage runtime, storage, operating system, middleware, and networking services, leaving customers to manage their applications and data. PaaS services are used by various e-commerce, financial, online gaming, and other related enterprises.

For efficient and effective isolation and resource management for clients, PaaS service providers use various tools. They virtualize the system environment for each client to achieve security, reliability, and robustness. The Linux kernel provides low-level mechanisms in the form of cgroups and namespaces for building various lightweight tools that can virtualize the system environment. Docker is one such framework that builds on cgroups and namespaces.

Namespaces fundamentally are mechanisms to abstract, isolate, and limit the visibility that a group of processes has over various system entities such as process trees, network interfaces, user IDs, and filesystem mounts. Namespaces are categorized into several groups, which we will now see.

Mount namespaces

Traditionally, mount and unmount operations will change the filesystem view as seen by all processes in the system; in other words, there is one global mount namespace seen by all processes. The mount namespaces confine the set of filesystem mount points visible within a process namespace, enabling one process group in a mount namespace to have an exclusive view of the filesystem list compared to another process.

UTS namespaces

These enable isolating the system's host and domain name within a uts namespace. This makes initialization and configuration scripts able to be guided based on the respective namespaces.

IPC namespaces

These demarcate processes from using System V and POSIX message queues. This prevents one process from an ipc namespace accessing the resources of another.

PID namespaces

Traditionally, *nix kernels (including Linux) spawn the init process with PID 1 during system boot, which in turn starts other user-mode processes and is considered the root of the process tree (all the other processes start below this process in the tree). The PID namespace allows a process to spin off a new tree of processes under it with its own root process (PID 1 process). PID namespaces isolate process ID numbers, and allow duplication of PID numbers across different PID namespaces, which means that processes in different PID namespaces can have the same process ID. The process IDs within a PID namespace are unique, and are assigned sequentially starting with PID 1.

PID namespaces are used in containers (lightweight virtualization solution) to migrate a container with a process tree, onto a different host system without any changes to PIDs.

Network namespaces

This type of namespace provides abstraction and virtualization of network protocol services and interfaces. Each network namespace will have its own network device instances that can be configured with individual network addresses. Isolation is enabled for other network services: routing table, port number, and so on.

User namespaces

User namespaces allow a process to use unique user and group IDs within and outside a namespace. This means that a process can use privileged user and group IDs (zero) within a user namespace and continue with non-zero user and group IDs outside the namespace.

Cgroup namespaces

A cgroup namespace virtualizes the contents of the /proc/self/cgroup file. Processes inside a cgroup namespace are only able to view paths relative to their namespace root.

Control groups (cgroups)

Cgroups are kernel mechanisms to restrict and measure resource allocations to each process group. Using cgroups, you can allocate resources such as CPU time, network, and memory.

Similar to the process model in Linux, where each process is a child to a parent and relatively descends from the init process thus forming a single-tree like structure, cgroups are hierarchical, where child cgroups inherit the attributes of the parent, but what makes is different is that multiple cgroup hierarchies can exist within a single system, with each having distinct resource prerogatives.

Applying cgroups on namespaces results in isolation of processes into containers within a system, where resources are managed distinctly. Each container is a lightweight virtual machine, all of which run as individual entities and are oblivious of other entities within the same system.

The following are namespace APIs described in the Linux man page for namespaces:

clone(2)
The clone(2) system call creates a new process. If the flags argument of the call specifies one or more of the CLONE_NEW* flags listed below, then new namespaces are created for each flag, and the child process is made a member of those namespaces.(This system call also implements a number of features unrelated to namespaces.)

setns(2)
The setns(2) system call allows the calling process to join an existing namespace. The namespace to join is specified via a file descriptor that refers to one of the /proc/[pid]/ns files described below.

unshare(2)
The unshare(2) system call moves the calling process to a new namespace. If the flags argument of the call specifies one or more of the CLONE_NEW* flags listed below, then new namespaces are created for each flag, and the calling process is made a member of those namespaces. (This system call also implements a number of features unrelated to namespaces.)
Namespace   Constant          Isolates
Cgroup      CLONE_NEWCGROUP   Cgroup root directory
IPC         CLONE_NEWIPC      System V IPC, POSIX message queues
Network     CLONE_NEWNET      Network devices, stacks, ports, etc.
Mount       CLONE_NEWNS       Mount points
PID         CLONE_NEWPID      Process IDs
User        CLONE_NEWUSER     User and group IDs
UTS         CLONE_NEWUTS      Hostname and NIS domain name

Summary

We understood one of the principal abstractions of Linux called the process, and the whole ecosystem that facilitates this abstraction. The challenge now remains in running the scores of processes by providing fair CPU time. With many-core systems imposing a multitude of processes with diverse policies and priorities, the need for deterministic scheduling is paramount.

In our next chapter, we will delve into process scheduling, another critical aspect of process management, and comprehend how the Linux scheduler is designed to handle this diversity.

About the Author

CH Raghav Maruthi

Raghu Bharadwaj is a leading consultant, contributor, and corporate trainer on the Linux kernel with experience spanning close to two decades. He is an ardent kernel enthusiast and expert, and has been closely following the Linux kernel since the late 90s. He is the founder of TECH VEDA, which specializes in engineering and skilling services on the Linux kernel, through technical support, kernel contributions, and advanced training. His precise understanding and articulation of the kernel has been a hallmark, and his penchant for software designs and OS architectures has garnered him special mention from his clients. Raghu is also an expert in delivering solution-oriented, customized training programs for engineering teams working on the Linux kernel, Linux drivers, and Embedded Linux. Some of his clients include major technology companies such as Xilinx, GE, Canon, Fujitsu, UTC, TCS, Broadcom, Sasken, Qualcomm, Cognizant, STMicroelectronics, Stryker, and Lattice Semiconductors.
Browse publications by this author

Good points: - The most up-to-date book on subject to my knowledge. - The style is concise, no useless or repetetive informations. - Important topics are explained. Bad points: - Some parts require knowlege on subjects like how CPU or virtual memory works. I would have appreciate a introduction sentence or some references to ressources on these subjects. (but others books have this same "bad point"). Others: - Title may be misleading: this is just an introduction to the Linux programming - More advanced english words than usual.

Magnificent book! This is for sure a must for any interested in the subject!

내용이 아주 좋다...(good contents)