From Source to Binary
In programming, everything starts with source code. In reality, source code, which sometimes goes by the other name of the code base, usually consists of a number of text files. Within that, each of those text files contains textual instructions written in a programming language.
We know that a CPU cannot execute textual instructions. The reality is that these instructions should first be compiled (or translated) to machine-level instructions in order to be executed by a CPU, which eventually will result in a running program.
In this chapter, we go through the steps needed to get a final product out of C source code. This chapter goes into the subject in great depth, and as such we've split it into five distinct sections:
Compilation pipeline
Compiling some C files usually takes a few seconds, but during this brief period of time, the source code enters a pipeline that has four distinct components, with each of them doing a certain task. These components are as follows:
Each component in this pipeline accepts a certain input from the previous component and produces a certain output for the next component in the pipeline. This process continues through the pipeline until a product is generated by the last component.
Source code can be turned into a product if, and only if, it passes through all the required components with success. This means that even a small failure in one of the components can lead to a compilation or linkage failure, resulting in you receiving relevant error messages.
For certain intermediate products such as relocatable object files, it is enough that a single source file goes through the first three components...
Preprocessor
At the very start of this book in Chapter 1, Essential Features, we introduced, albeit briefly, the concepts of C preprocessor. Specifically, we talked there about macros, conditional compilation, and header guards.
You will remember that at the beginning of the book, we discussed C preprocessing as an essential feature of the C language. Preprocessing is unique due to the fact that it cannot be easily found in other programming languages. In the simplest terms, preprocessing allows you to modify your source code before sending it for compilation. At the same time, it allows you to divide your source code, especially the declarations, into header files so that you can later include them into multiple source files and reuse those declarations.
It is vital to remember that if you have a syntax error in your source code, the preprocessor will not find the error as it does not know anything about the C syntax. Instead, it will just perform some easy tasks, which typically...
Compiler
As we discussed in the previous sections, the compiler accepts the translation unit prepared by the preprocessor and generates the corresponding assembly instructions. When multiple C sources are compiled into their equivalent assembly code, the existing tools in the platform, such as the assembler and the linker, manage the rest by making relocatable object files out of the generated assembly code and finally linking them together (and possibly with other object files) to form a library or an executable file.
As an example, we spoke about as and ld as two examples among the many available tools in Unix for C development. These tools are mainly used to create platform-compatible object files. These tools exist necessarily outside of gcc or any other compiler. By existing outside of any compiler, we actually mean that they are not developed as a part of gcc (we have chosen gcc as an example) and they should be available on any platform even without having gcc installed....
Assembler
As we explained before, a platform has to have an assembler in order to produce object files that contain correct machine-level instructions. In a Unix-like operating system, the assembler can be invoked by using the as utility program. In the rest of this section, we are going to discuss what can be put in an object file by the assembler.
If you install two different Unix-like operating systems on the same architecture, the installed assemblers might not be the same, which is very important. What this means is that, despite the fact that the machine-level instructions are the same, because of being on the same hardware, the produced object files can be different!
If you compile a program and produce the corresponding object file on Linux for an AMD64 architecture, it could be different from if you had tried to compile the same program in a different operating system such as FreeBSD or macOS, and on the same hardware. This implies that while the object files cannot...
Linker
The first big step in building a C project is compiling all the source files to their corresponding relocatable object files. This step is a necessary step in preparing the final products, but alone, it is not enough, and one more step is still needed. Before going through the details of this step, we need to have a quick look at the possible products (sometimes referred to as artifacts) in a C project.
A C/C++ project can lead to the following products:
- A number of executable files that usually have the
.outextension in most Unix-like operating systems. These files usually have the.exeextension in Microsoft Windows. - A number of static libraries that usually have the
.aextension in most Unix-like operating systems. These files have the.libextension in Microsoft Windows. - A number of dynamic libraries or shared object files that usually have the
.soextension in most Unix-like operating systems. These files have the.dylibextension in macOS, and...
Summary
In this chapter, we covered the fundamental steps and components required to build a C project. Without knowing how to build a project, it is pointless to just write code. In this chapter:
- We went through the C compilation pipeline and its various steps. We discussed each step and described the inputs and the outputs.
- We defined the term platform and how different assemblers can lead to different machine-level instructions for the same C program.
- We continued to discuss each step and the component driving that step in a greater detail.
- As part of the compiler component, we explained what the compiler frontends and backends are, and how GCC and LLVM use this separation to support many languages.
- As part of our discussion regarding the assembler component, we saw that object files are platform-dependent, and they should have an exact file format.
- As part of the linker component, we discussed what a linker does and how it uses symbols to find the missing...