Where It All Starts From – The Virtual Filesystem
Even with astronomical advances in software development, the Linux kernel remains one of the most complex pieces of code. Developers, programmers, and would-be kernel hackers constantly look to dive into kernel code and push for new features, whereas hobbyists and enthusiasts try to understand and unravel those mysteries.
Naturally, a lot has been written on Linux and its internal workings, from general administration to kernel programming. Over the decades, hundreds of books have been published, which cover a diverse range of important operating system topics, such as process creation, threading, memory management, virtualization, filesystem implementations, and CPU scheduling. This book that you’ve picked up (thank you!) will focus on the storage stack in Linux and its multilayered organization.
We’ll start by introducing the Virtual Filesystem in the Linux kernel and its pivotal role in allowing end user programs to access data on filesystems. Since we intend to cover the entire storage stack in this book, from top to bottom, getting a deeper understanding of the Virtual Filesystem is extremely important, as it is the starting point of an I/O request in the kernel. We’ll introduce the concept of user space and kernel space, understand system calls, and see how the Everything is a file philosophy in Linux is tied to the Virtual Filesystem.
In this chapter, we’re going to cover the following main topics:
- Understanding storage in a modern-day data center
- Defining system calls
- Explaining the need for a Virtual Filesystem
- Describing the Virtual Filesystem
- Explaining the Everything is a file philosophy
Technical requirements
Before going any further, I think is important to acknowledge here that certain technical topics may be more challenging for beginners to comprehend than others. Since the goal here is to comprehend the inner workings of the Linux kernel and its major subsystems, it will be helpful to have a decent foundational understanding of operating system concepts in general and Linux in particular. Above all, it is important to approach these topics with patience, curiosity, and a willingness to learn.
The commands and examples presented in this chapter are distribution-agnostic and can be run on any Linux operating system, such as Debian, Ubuntu, Red Hat, and Fedora. There are a few references to the kernel source code. If you want to download the kernel source, you can download it from https://www.kernel.org. The operating system packages relevant to this chapter can be installed as follows:
- For Ubuntu/Debian:
sudo apt
install strace
sudo apt
install bcc
- For Fedora/CentOS/Red Hat-based systems:
sudo yum
install strace
sudo yum
install bcc-tools
Understanding storage in a modern-day data center
Compute, storage, and networking are the basic building blocks of any infrastructure. How well your applications do is often dependent on the combined performance of these three layers. The workloads running in a modern data center vary from streaming services to machine learning applications. With the meteoric rise and adoption of cloud computing platforms, all the basic building blocks are now abstracted from the end user. Adding more hardware resources to your application, as it becomes resource-hungry, is the new normal. Troubleshooting performance issues is often skipped in favor of migrating applications to better hardware platforms.
Of the three building blocks, compute, storage, and networking, storage is often considered the bottleneck in most scenarios. For applications such as databases, the performance of the underlying storage is of prime importance. In cases where infrastructure hosts mission-critical and time-sensitive applications such as Online Transaction Processing (OLTP), the performance of storage frequently comes under the radar. The smallest of delays in servicing I/O requests can impact the overall response of the application.
The most common metric used to measure storage performance is latency. The response times of storage devices are usually measured in milliseconds. Compare that with your average processor or memory, where such measurements are measured in nanoseconds, and you’ll see how the performance of the storage layer can impact the overall working of your system. This results in a state of incongruity between the application requirements and what the underlying storage can actually deliver. For the last few years, most of the advancements in modern-day storage drives have been geared toward sizing – the capacity arena. However, performance improvement of the storage hardware has not progressed at the same rate. Compared to the compute functions, the performance of storage pales in comparison. For these reasons, it is often termed the three-legged dog of the data center.
Having made a point about the choice of a storage medium, it’s pertinent to note that no matter how powerful it is, the hardware will always have limitations in its functionality. It’s equally important for the application and operating system to tune themselves according to the hardware. Fine-tuning your application, operating system, and filesystem parameters can give a major boost to the overall performance. To utilize the underlying hardware to its full potential, all layers of the I/O hierarchy need to function efficiently.
Interacting with storage in Linux
The Linux kernel makes a clear distinction between the user space and kernel space processes. All the hardware resources, such as CPU, memory, and storage, lie in the kernel space. For any user space application wanting to access the resources in kernel space, it has to generate a system call, as shown in Figure 1.1:

Figure 1.1 – The interaction between user space and kernel space
User space refers to all the applications and processes that live outside of the kernel. The kernel space includes programs such as device drivers, which have unrestricted access to the underlying hardware. The user space can be considered a form of sandboxing to restrict the end user programs from modifying critical kernel functions.
This concept of user and kernel space is deeply rooted in the design of modern processors. A traditional x86 CPU uses the concept of protection domains, called rings, to share and limit access to hardware resources. Processors offer four rings or modes, which are numbered from 0 to 3. Modern-day processors are designed to operate in two of these modes, ring 0 and ring 3. The user space applications are handled in ring 3, which has limited access to kernel resources. The kernel occupies ring 0. This is where the kernel code executes and interacts with the underlying hardware resources.
When processes need to read from or write to a file, they need to interact with the filesystem structures on top of the physical disk. Every filesystem uses different methods to organize data on the physical disk. The request from the process doesn’t directly reach the filesystem or physical disk. In order for the I/O request of the process to be served by the physical disk, it has to traverse through the entire storage hierarchy in the kernel. The first layer in that hierarchy is known as the Virtual Filesystem. The following figure, Figure 1.2, highlights the major components of the Virtual Filesystem:

Figure 1.2 – The Virtual Filesystem (VFS) layer in the kernel
The storage stack in Linux consists of a multitude of cohesive layers, all of which ensure that the access to physical storage media is abstracted through a unified interface. As we move forward, we’re going to build upon this structure and add more layers. We’ll try to dig deep into each of them and see how they all work in harmony.
This chapter will focus solely on the Virtual Filesystem and its various features. In the coming chapters, we’re going to explain and uncover some under-the-hood workings of the more frequently used filesystems in Linux. However, bearing in mind the number of times the word filesystem is going to be used here, I think it’s prudent to briefly categorize the different filesystem types, just to avoid any confusion:
- Block filesystems: Block- or disk-based filesystems are the most common way to store user data. As a regular operating system user, these are the filesystems that users mostly interact with. Filesystems such as Extended filesystem version 2/3/4 (Ext 2/3/4), Extent filesystem (XFS), Btrfs, FAT, and NTFS are all categorized as disk-based or block filesystems. These filesystems speak in terms of blocks. The block size is a property of the filesystem, and it can only be set when creating a filesystem on a device. The block size indicates what size the filesystem will use when reading or writing data. We can refer to it as the logical unit of storage allocation and retrieval for a filesystem. A device that can be accessed in terms of blocks is, therefore, called a block device. Any storage device attached to a computer, whether it is a hard drive or an external USB, can be classified as a block device. Traditionally, block filesystems are mounted on a single host and do not allow sharing between multiple hosts.
- Clustered filesystems: Clustered filesystems are also block filesystems and use block-based access methods to read and write data. The difference is that they allow a single filesystem to be mounted and used simultaneously by multiple hosts. Clustered filesystems are based on the concept of shared storage, meaning that multiple hosts can concurrently access the same block device. Common clustered filesystems used in Linux are Red Hat’s Global File System 2 (GFS2) and Oracle Clustered File System (OCFS).
- Network filesystems (NFS): NFS is a protocol that allows for remote file sharing. Unlike regular block filesystems, NFS is based on the concept of sharing data between multiple hosts. NFS works with the concept of a client and a server. The backend storage is provided by an NFS server. The host systems on which the NFS filesystem is mounted are called clients. The connectivity between the client and server is achieved using conventional Ethernet. All NFS clients share a single copy of the file on the NFS server. NFS doesn’t offer the same performance as block filesystems, but it is still used in enterprise environments, mostly to store long-term backups and share common data.
- Pseudo filesystems: Pseudo filesystems exist in the kernel and generate their content dynamically. They are not used to store data persistently. They do not behave like regular disk-based filesystems such as Ext4 or XFS. The main purpose of a pseudo filesystem is to allow the user space programs to interact with the kernel. Directories such as
/proc (procfs)
and/sys (sysfs)
fall under this category. These directories contain virtual or temporary files, which include information about the different kernel subsystems. These pseudo filesystems are also a part of the Virtual Filesystem landscape, as we’ll see in the Everything is a file section.
Now that we have a basic idea about user space, kernel space, and the different types of filesystems, let’s explain how an application can request resources in kernel space through system calls.
Understanding system calls
While looking at the figure explaining the interaction between applications and the Virtual Filesystem, you may have noticed the intermediary layer between user space programs and the Virtual Filesystem; that layer is known as the system call interface. To request some service from the kernel, user space programs invoke the system call interface. These system calls provide the means for end user applications to access the resources in the kernel space, such as the processor, memory, and storage. The system call interface serves three main purposes:
- Ensuring security: System calls prevent user space applications from directly modifying resources in the kernel space
- Abstraction: Applications do not need to concern themselves with the underlying hardware specifications
- Portability: User programs can be run correctly on all kernels that implement the same set of interfaces
There’s often some confusion about the differences between system calls and an application programming interface (API). An API is a set of programming interfaces used by a program. These interfaces define a method of communication between two components. An API is implemented in user space and outlines how to acquire a particular service. A system call is a much lower-level mechanism that uses interrupts to make an explicit request to the kernel. The system call interface is provided by the standard C library in Linux.
If the system call generated by the calling process succeeds, a file descriptor is returned. A file descriptor is an integer number that is used to access files. For example, when a file is opened using the open ()
system call, a file descriptor is returned to the calling process. Once a file has been opened, programs use the file descriptor to perform operations on the file. All read, write, and other operations are performed using the file descriptor.
Every process always has a minimum of three files opened – standard input, standard output, and standard error – represented by the 0, 1, and 2 file descriptors, respectively. The next file opened will be assigned the file descriptor value of 3. If we do some file listing through ls
and run a simple strace
, the open system call will return a value of 3, which is the file descriptor representing the file – /etc/hosts
, in this case. After that, this file descriptor value of 3 is used by the fstat
and close
calls to perform further operations:
strace ls /etc/hosts root@linuxbox:~# strace ls /etc/hosts execve("/bin/ls", ["ls", "/etc/hosts"], 0x7ffdee289b48 /* 22 vars */) = 0 brk(NULL) = 0x562b97fc6000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=140454, ...}) = 0 mmap(NULL, 140454, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbaa2519000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
[The rest of the code is skipped for brevity.]
On x86 systems, there are around 330 system calls. This number could be different for other architectures. Each system call is represented by a unique integer number. You can list the available system calls on your system using the ausyscall
command. This will list the system calls and their corresponding integer values:
ausyscall –dump root@linuxbox:~# ausyscall --dump Using x86_64 syscall table: 0 read 1 write 2 open 3 close 4 stat 5 fstat 6 lstat 7 poll 8 lseek 9 mmap 10 mprotect
[The rest of the code is skipped for brevity.]
root@linuxbox:~# ausyscall --dump|wc -l 334 root@linuxbox:~#
The following table lists some common system calls:
System call |
Description |
|
Open and close files |
|
Create a file |
|
Change the |
|
Mount and unmount filesystems |
|
Change the pointer position in a file |
|
Read and write in a file |
|
Get a file status |
|
Get filesystem statistics |
|
Execute the program referred to by pathname |
|
Checks whether the calling process can access the file pathname |
|
Creates a new mapping in the virtual address space of the calling process |
Table 1.1 – Some common system calls
So, what role do the system calls play in interacting with filesystems? As we’ll see in the succeeding section, when a user space process generates a system call to access resources in the kernel space, the first component it interacts with is the Virtual Filesystem. This system call is first handled by the corresponding system call handler in the kernel, and after validating the operation requested, the handler makes a call to the appropriate function in the VFS layer. The VFS layer passes the request on to the appropriate filesystem driver module, which performs the actual operations on the file.
We need to understand the why here – why would the process interact with the Virtual Filesystem and not the actual filesystem on the disk? In the upcoming section, we’ll try to figure this out.
To summarize, the system calls interface in Linux implements generic methods that can be used by the applications in user space to access resources in the kernel space.
Explaining the need for a Virtual Filesystem
A standard filesystem is a set of data structures that determine how user data is organized on a disk. End users are able to interact with this standard filesystem through regular file access methods and perform common tasks. Every operating system (Linux or non-Linux) provides at least one such filesystem, and naturally, each of them claims to be better, faster, and more secure than the other. A great majority of modern Linux distributions use XFS or Ext4 as the default filesystem. These filesystems have several features and are considered stable and reliable for daily usage.
However, the support for filesystems in Linux is not limited to only these two. One of the great benefits of using Linux is that it offers support for multiple filesystems, all of which can be considered perfectly acceptable alternatives to Ext4 and XFS. Because of this, Linux can peacefully coexist with other operating systems. Some of the more commonly used filesystems include older versions of Ext4, such as Ext2 and Ext3, Btrfs, ReiserFS, OpenZFS, FAT, and NTFS. When using multiple partitions, users can choose from a long list of available filesystems and create a different one on every disk partition as per their needs.
The smallest addressable unit of a physical hard drive is a sector. For filesystems, the smallest writable unit is called a block. A block can be considered a group of consecutive sectors. All operations by a filesystem are performed in terms of blocks. There is no singular way in which these blocks are addressed and organized by different filesystems. Each filesystem may use a different set of data structures to allocate and store data on these blocks. The presence of a different filesystem on each storage partition can be difficult to manage. Given the wide range of supported filesystems in Linux, imagine if applications needed to understand the distinct details of every filesystem. In order to be compatible with a filesystem, the application would need to implement a unique access method for each filesystem it uses. This would make the design of an application almost impractical.
Abstraction interfaces play a critical role in the Linux kernel. In Linux, regardless of the filesystem being used, the end users or applications can interact with the filesystem using uniform access methods. All this is achieved through the Virtual Filesystem layer, which hides the filesystem implementations under an all-inclusive interface.
Describing the VFS
To ensure that applications do not face any such obstacles (as mentioned earlier) when working with different filesystems, the Linux kernel implements a layer between end user applications and the filesystem on which data is being stored. This layer is known as the Virtual Filesystem (VFS). The VFS is not a standard filesystem, such as Ext4 or XFS. (There is no mkfs.vfs
command!) For this reason, some prefer the term Virtual Filesystem Switch.
Think of the magic wardrobe from The Chronicles of Narnia. The wardrobe is actually a portal to the magical world of Narnia. Once you step through the wardrobe, you can explore the new world and interact with its inhabitants. The wardrobe facilitates accessing the magical world. In a similar way, the VFS provides a doorway to different filesystems.
The VFS defines a generic interface that allows multiple filesystems to coexist in Linux. It’s worth mentioning again that with the VFS, we’re not talking about a standard block-based filesystem. We’re talking about an abstraction layer that provides a link between the end user application and the actual block filesystems. Through the standardization implemented in the VFS, applications can perform read and write operations, without worrying about the underlying filesystem.
As shown in Figure 1.3, the VFS is interposed between the user space programs and actual filesystems:

Figure 1.3 – The VFS acts as a bridge between user space programs and filesystems
For the VFS to provide services to both parties, the following has to apply:
- All end user applications need to define their filesystem operations in terms of the standard interface provided by the VFS
- Every filesystem needs to provide an implementation of the common interface provided by the VFS
We explained that applications in user space need to generate system calls when they want to access resources in the kernel space. Through the abstraction provided by the VFS, system calls such as read()
and write()
function properly, regardless of the filesystem in use. These system calls work across filesystem boundaries. We don’t need a special mechanism to move data to a different or non-native filesystem. For instance, we can easily move data from an Ext4 filesystem to XFS, and vice versa. At a very high level, when a process issues the read()
or write()
system call to read or write a file, the VFS will search for the filesystem driver to use and forward these system calls to that driver.
Implementing a common filesystem interface through the VFS
The primary goal of the VFS is to represent a diverse set of filesystems in the kernel with minimum overhead. When a process requests a read or write operation on a file, the kernel substitutes this with the filesystem-specific function on which the file resides. In order to achieve this, every filesystem must adapt itself in terms of the VFS.
Let’s go through the following example for a better understanding.
Consider the example of the cp
(copy
) command in Linux. Let’s suppose we’re trying to copy a file from an Ext4 to an XFS filesystem. How does this copy operation complete? How does the cp
command interact with the two filesystems? Have a look at Figure 1.4:

Figure 1.4 – The VFS ensures interoperability between different filesystems
First off, the cp
command doesn’t care about the filesystems being used. We’ve defined the VFS as the layer that implements abstraction. So, the cp
command doesn’t need to concern itself about the filesystem details. It will interact with the VFS layer through the standard system call interface. Specifically, it will issue the open ()
and read ()
system calls to open and read the file to be copied. An open file is represented by the file data structure in the kernel (as we’ll learn in the next chapter, Chapter 2, Explaining the Data Structures in a VFS.
When cp
generates these generic system calls, the kernel will redirect these calls to the appropriate function of the filesystem through a pointer, on which the file resides. To copy the file to the XFS filesystem, the write()
system call is passed to the VFS. This will again be redirected to the particular function of the XFS filesystem that implements this feature. Through system calls issued to the VFS, the cp
process can perform a copy operation using the read ()
method of Ext4 and the write ()
method of XFS. Just like a switch, the VFS will switch the common file access methods between their designated filesystem implementations.
The read, write, or any other function for that matter does not have a default definition in the kernel – hence the name virtual. The interpretation of these operations depends upon the underlying filesystem. Just like user programs that take advantage of this abstraction offered by the VFS, filesystems also reap the benefits of this approach. Common access methods for files do not need to be reimplemented by filesystems.
That was pretty neat, right? But what if we want to copy something from Ext4 to a non-native filesystem? Filesystems such as Ext4, XFS, and Btrfs were specifically designed for Linux. What if one of the filesystems involved in this operation is FAT or NTFS?
Admittedly, the design of the VFS is biased toward filesystems that come from the Linux tribe. To an end user, there is a clear distinction between a file and a directory. In the Linux philosophy, everything is a file, including directories. Filesystems native to Linux, such as Ext4 and XFS, were designed keeping these nuances in mind. Because of the differences in the implementation, non-native filesystems such as FAT and NTFS do not support all of the VFS operations. The VFS in Linux uses structures such as inodes, superblocks, and directory entries to represent a generic view of a filesystem. Non-native Linux filesystems do not speak in terms of these structures. So how does Linux accommodate these filesystems? Take the example of the FAT filesystem. The FAT filesystem comes from a different world and doesn’t use these structures to represent files and directories. It doesn’t treat directories as files. So, how does the VFS interact with the FAT filesystem?
All filesystem-related operations in the kernel are firmly integrated with the VFS data structures. To accommodate non-native filesystems on Linux, the kernel constructs the corresponding data structures dynamically. For instance, to satisfy the common file model for filesystems such as FAT, files corresponding to directories will be created in memory on the fly. These files are virtual and will only exist in memory. This is an important concept to understand. On native filesystems, structures such as inodes and superblocks are not only present in memory but also stored on the physical medium itself. Conversely, non-Linux filesystems merely have to perform the enactment of such structures in memory.
Peeking at the source code
If we take a look at the kernel source code, the different functions provided by the VFS are present in the fs
directory. All source files ending in .c
contain implementations of the different VFS methods. The subdirectories contain specific filesystem implementations, as shown in Figure 1.5:

Figure 1.5 – The source for kernel 5.19.9
You’ll notice source files such as open.c
and read_write.c
, which are the functions invoked when a user space process generates open ()
, read ()
, and write ()
system calls. These files contain a lot of code, and since we won’t create any new code here, this is merely a poking exercise. Nevertheless, there are a few important pieces of code in these files that highlight what we explained earlier. Let’s take a quick peek at the read and write functions.
The SYSCALL_DEFINE3
macro is the standard way to define a system call and takes the name of the system call as one of the parameters.
For the write
system call, this definition looks as follows. Note that one of the parameters is the file descriptor:
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) { return ksys_write(fd, buf, count); }
Similarly, this is the definition for the read
system call:
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { return ksys_read(fd, buf, count); }
Both call the ksys_write ()
and ksys_read ()
functions. Let’s see the code for these two functions:
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; ******* Skipped ******* ret = vfs_read(f.file, buf, count, ppos); ******* Skipped ******* return ret; } ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; ******* Skipped ******* ret = vfs_write(f.file, buf, count, ppos); ******* Skipped ******* return ret; }
The presence of the vfs_read ()
and vfs_write ()
functions indicates that we’re transitioning to the VFS. These functions look up the file_operations
structure for the underlying filesystem and invoke the appropriate read ()
and write ()
methods:
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { ******* Skipped ******* if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else if (file->f_op->read_iter) ret = new_sync_read(file, buf, count, pos); ******* Skipped ******* return ret; } ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { ******* Skipped ******* if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else if (file->f_op->write_iter) ret = new_sync_write(file, buf, count, pos); ******* Skipped ******* return ret; }
Each filesystem defines the file_operations
structure of pointers for supporting operations. There are multiple definitions of the file_operations
structure in the kernel source code, unique to each filesystem. The operations defined in this structure describe how read or write functions will be performed:
root@linuxbox:/linux-5.19.9/fs# grep -R "struct file_operations" * | wc -l 453 root@linuxbox:/linux-5.19.9/fs# grep -R "struct file_operations" * 9p/vfs_dir.c:const struct file_operations v9fs_dir_operations = { 9p/vfs_dir.c:const struct file_operations v9fs_dir_operations_dotl = { 9p/v9fs_vfs.h:extern const struct file_operations v9fs_file_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_file_operations_dotl; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_dir_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_dir_operations_dotl; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_cached_file_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_cached_file_operations_dotl; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_mmap_file_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_mmap_file_operations_dotl;
[The rest of the code is skipped for brevity.]
As you can see, the file_operations
structure is used for a wide range of file types, including regular files, directories, device files, and network sockets. In general, any type of file that can be opened and manipulated using standard file I/O operations can be covered by this structure.
Tracing VFS functions
There are quite a few tracing mechanisms available in Linux that can offer a glance at how things work under the hood. One of them is the BPF Compiler Collection (BCC) tools. These tools offer a wide range of scripts that can record events for different subsystems in the kernel. You can install these tools for your operating system by following the instructions in the Technical requirements section. For now, we’re just going to use one of the programs from this toolkit, called funccount
. As the name suggests, funccount
counts the number of function calls:
root@linuxbox:~# funccount --help usage: funccount [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T] [-r] [-D] [-c CPU] pattern Count functions, tracepoints, and USDT probes
Just to test and verify our understanding of what we stated earlier, we’re going to run a simple copy process in the background and use the funccount
program to trace the VFS functions that are invoked as a result of the cp
command. As we’re going to count the VFS calls for the cp
process only, we need to use the -p
flag to specify a process ID. The vfs_*
parameter will trace all the VFS functions for the process. You’ll see that the vfs_read ()
and vfs_write ()
functions are invoked by the cp
process. The COUNT
column specifies the number of times the function was called:
funccount -p process_ID 'vfs_*' [root@linuxbox ~]# nohup cp myfile /tmp/myfile & [1] 1228433 [root@linuxbox ~]# nohup: ignoring input and appending output to 'nohup.out' [root@linuxbox ~]# [root@linuxbox ~]# funccount -p 1228433 "vfs_*" Tracing 66 functions for "b'vfs_*'"... Hit Ctrl-C to end. ^C FUNC COUNT b'vfs_read' 28015 b'vfs_write' 28510 Detaching... [root@linuxbox ~]#
Let’s run this again and see what system calls are used when doing a simple copy operation. As expected, the most frequently used system calls when doing cp
are read and write:
funccount 't:syscalls:sys_enter_*' -p process_ID [root@linuxbox ~]# nohup cp myfile /tmp/myfile & [1] 1228433 [root@linuxbox ~]# nohup: ignoring input and appending output to 'nohup.out' [root@linuxbox ~]# [root@linuxbox ~]# /usr/share/bcc/tools/funccount -p 1228433 "vfs_*" Tracing 66 functions for "b'vfs_*'"... Hit Ctrl-C to end. ^C FUNC COUNT b'vfs_read' 28015 b'vfs_write' 28510 Detaching... [root@linuxbox ~]#
Let’s summarize what we covered in this section. Linux offers support for a wide range of filesystems, and the VFS layer in the kernel ensures that this can be achieved without any hassle. The VFS provides a standardized way for end user processes to interact with the different filesystems. This standardization is achieved by implementing a common file mode. The VFS defines several virtual functions for common file operations. As a result of this approach, applications can universally perform regular file operations. When a process generates a system call, the VFS will redirect these calls to the appropriate function of the filesystem.
Explaining the Everything is a file philosophy
In Linux, all of the following are considered files:
- Directories
- Disk drives and their partitions
- Sockets
- Pipes
- CD-ROM
The phrase everything is a file implies that all the preceding entities in Linux are represented by file descriptors, abstracted over the VFS. You could also say that everything has a file descriptor, but let’s not indulge in that debate.
The everything is a file ideology that characterizes the architecture of a Linux system is also implemented courtesy of the VFS. Earlier, we defined pseudo filesystems as filesystems that generate their content on the fly. These filesystems are also referred to as VFSes and play a major role in implementing this concept.
You can retrieve the list of filesystems currently registered with the kernel through the procfs
pseudo filesystem. When seeing this list, note nodev
in the first column against some filesystems. nodev
indicates that this is a pseudo filesystem and is not backed by a block device. Filesystems such as Ext2, 3, and 4 are created on a block device; hence, they do not have the nodev
entry in the first column:
cat /proc/filesystems [root@linuxbox ~]# cat /proc/filesystems nodev sysfs nodev tmpfs nodev bdev nodev proc nodev cgroup nodev cgroup2 nodev cpuset nodev devtmpfs nodev configfs nodev debugfs nodev tracefs nodev securityfs nodev sockfs nodev bpf nodev pipefs nodev ramfs
[The rest of the code is skipped for brevity.]
You can also use the mount
command to find out about the currently mounted pseudo filesystems in your system:
mount | grep -v sd | grep -ivE ":/|mapper" [root@linuxbox ~]# mount | grep -v sd | grep -ivE ":/|mapper" sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=1993552k,nr_inodes=498388,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
[The rest of the code is skipped for brevity.]
Let’s take a tour of the /proc
directory. You’ll see a long list of numbered directories; these numbers represent the IDs of all the processes that currently run on your system:
[root@linuxbox ~]# ls /proc/ 1 1116 1228072 1235 1534 196 216 30 54 6 631 668 810 ioports scsi 10 1121 1228220 1243 1535 197 217 32 55 600 632 670 9 irq self 1038 1125 1228371 1264 1536 198 218 345 56 602 633 673 905 kallsyms slabinfo 1039 1127 1228376 13 1537 199 219 347 570 603 634 675 91 kcore softirqs 1040 1197 1228378 14 1538 2 22 348 573 605 635 677 947 keys stat 1041 12 1228379 1442 16 20 220 37 574 607 636 679 acpi key-users swaps 1042 1205 1228385 1443 1604 200 221 38 576 609 637 681 buddyinfo kmsg sys 1043 1213 1228386 1444 1611 201 222 39 577 610 638 684 bus kpagecgroup sysrq-
[The rest of the code is skipped for brevity.]
The procfs
filesystem offers us a glimpse into the running state of the kernel. The content in /proc
is generated when we want to view this information. This information is not persistently present on your disk drive. This all happens in memory. As you can see from the ls
command, the size of /proc
on disk is zero bytes:
[root@linuxbox ~]# ls -ld /proc/ dr-xr-xr-x 292 root root 0 Sep 20 00:41 /proc/ [root@linuxbox ~]#
/proc
provides an on-the-spot view of the processes running on the system. Consider the /proc/cpuinfo
file. This file displays the processor-related information for your system. If we check this file, it will be shown as empty
:
[root@linuxbox ~]# ls -l /proc/cpuinfo -r--r--r-- 1 root root 0 Nov 5 02:02 /proc/cpuinfo [root@linuxbox ~]# [root@linuxbox ~]# file /proc/cpuinfo /proc/cpuinfo: empty [root@linuxbox ~]#
However, when the file contents are viewed through cat
, they show a lot of information:
[root@linuxbox ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz stepping : 1 microcode : 0xb00003e cpu MHz : 2099.998 cache size : 40960 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes
[The rest of the code is skipped for brevity.]
Linux abstracts all entities such as processes, directories, network sockets, and storage devices into the VFS. Through the VFS, we can retrieve information from the kernel. Most Linux distributions offer a variety of tools for monitoring the consumption of storage, compute, and memory resources. All these tools gather stats for various metrics through the data available in procfs
. For instance, the mpstat
command, which provides stats about all the processors in a system, retrieves data from the /proc/stat
file. It then presents this data in a human-readable format for a better understanding:
[root@linuxbox ~]# cat /proc/stat cpu 5441359 345061 4902126 1576734730 46546 1375926 942018 0 0 0 cpu0 1276258 81287 1176897 394542528 13159 255659 280236 0 0 0 cpu1 1455759 126524 1299970 394192241 13392 314865 178446 0 0 0 cpu2 1445048 126489 1319450 394145153 12496 318550 186289 0 0 0 cpu3 1264293 10760 1105807 393854806 7498 486850 297045 0 0 0
[The rest of the code is skipped for brevity.]
If we use the strace
utility on the mpstat
command, it will show that under the hood, mpstat
uses the /proc/stat
file to display processor stats:
strace mpstat 2>&1 |grep "/proc/stat" [root@linuxbox ~]# strace mpstat 2>&1 |grep "/proc/stat" openat(AT_FDCWD, "/proc/stat", O_RDONLY) = 3 [root@linuxbox ~]#
Similarly, popular commands such as top
, ps
, and free
gather memory-related information from the /
proc/meminfo
file:
[root@linuxbox ~]# strace free -h 2>&1 |grep meminfo openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 3 [root@linuxbox ~]#
Similar to /proc
, another commonly used pseudo filesystem is sysfs, which is mounted at /sys
. The sysfs filesystem mostly contains information about hardware devices on your system. For example, to find information about the disk drive in your system, such as its model, you can issue the following command:
cat /sys/block/sda/device/model [root@linuxbox ~]# cat /sys/block/sda/device/model SAMSUNG MZMTE512 [root@linuxbox ~]#
Even LEDs on a keyboard have a corresponding file in /sys
:
[root@linuxbox ~]# ls /sys/class/leds ath9k-phy0 input4::capslock input4::numlock input4::scrolllock [root@linuxbox ~]#
The everything is a file philosophy is one of the defining features of the Linux kernel. It signifies that everything in a system, including regular text files, directories, and devices, can be abstracted over the VFS layer in the kernel. As a result, all these entities can be represented as file-like objects through the VFS layer. There are several pseudo filesystems in Linux that contain information about the different kernel subsystems. The content of these pseudo filesystems is only present in memory and generated dynamically.
Summary
The Linux storage stack is a complex design and consists of multiple layers, all of which work in coordination. Like other hardware resources, storage lies in the kernel space. When a user space program wants to access any of these resources, it has to invoke a system call. The system call interface in Linux allows user space programs to access resources in the kernel space. When a user space program wants to access something on the disk, the first component it interacts with is the VFS subsystem. The VFS provides an abstraction of filesystem-related interfaces and is responsible for accommodating multiple filesystems in the kernel. Through its common filesystem interface, the VFS intercepts the generic system calls (such as read()
and write()
) from the user space programs and redirects them to their appropriate interfaces in the filesystem layer. Because of this approach, the user space programs do not need to worry about the underlying filesystems being used, and they can uniformly perform filesystem operations.
This chapter served as an introduction to the major Linux kernel subsystem Virtual Filesystem and its primary functions in the Linux kernel. The VFS provides a common interface for all filesystems through data structures, such as inodes, superblocks, and directory entries. In the next chapter, we will take a look at these data structures and explain how they all help the VFS to manage multiple filesystems.