Mastering Assembly Programming

Intel Architecture

-What languages do you usually use?
-C and Assembly. In fact, I love programming in Assembly.
-Hmmm... I would not have publicly admitted that...

When speaking about the Assembly language, people usually imagine a sort of unknown and dangerous beast, which obeys only the weirdest representatives of the programming community, or a gun that may only be used for shooting your own leg. Just as any prejudice, this one is rooted in ignorance and the primal fear of the unknown. The purpose of this book is not only to help you overcome this prejudice, but also to show how the Assembly language may become a powerful tool, a sharp lancet, that will help you perform certain tasks, even sophisticated ones, with elegance and relative simplicity, avoiding the unnecessary complications which are, sometimes, implied by high-level languages.

First of all, what is the Assembly language? To put it simply and precisely, we may safely define the Assembly language as symbolic or human readable machine code as each Assembly instruction translates into a single machine instruction (with a few exceptions). To be even more precise, there is no such thing as a single Assembly language as, instead, there are numerous Assembly languages--one per platform, where a platform is a programmable device. Almost any programmable device with a certain instruction set may have its own Assembly language, but this is not always so. Exceptions are devices such as, for example, NAND flash chips, which have their own command set, but have no means for fetching instructions from memory and executing them without implicitly being told to do so.

In order to be able to effectively use the Assembly language, one has to have a precise understanding of the underlying platform, as programming in the Assembly language means "talking" directly to the device. The deeper such understanding is, the more efficient is Assembly programming; however, we are not going to look at this in great detail, as this is beyond the scope of the book. One book would not be enough to cover each and every aspect of the specific architecture. Since we are going to concentrate on the Intel architecture during the course of this book, let's try to obtain at least a general understanding of Intel's x86/AMD64 architectures, and try to enrich it and make it deeper.

This chapter, in particular, covers processor registers and the functionality thereof and briefly describes memory organization (for example, segmentation and paging).

General purpose registers: Despite the fact that some of them have special meanings under certain circumstances, these registers, as the name of the group states, may be used for any purpose.
Floating point registers: These registers are used for floating point operations.
Segment registers: These registers are hardly accessed by applications (the most common case is setting up structured exception handlers on Windows); however, it is important to cover them here so we may better understand the way the CPU percives RAM. The part of the chapter that discusses segment registers also addresses a few memory organization aspects, such as segmentation and paging.
Control registers: This is a tiny group of registers of registers of high importance, as they control the behavior of the processor as well as enable or disable certain features.
Debug registers: Although registers of this group are mainly used by debuggers, they add some interesting abilities to our code, for example the ability to set hardware breakpoints when tracing execution of a program.
EFlags register: This is also known as the status register on some platforms. This one provides us with the information regarding the result of the latest arithmetic logic unit (ALU) operation performed, as well as some settings of the CPU itself.

Processor registers

Each programmable device, and Intel processors are not an exception, has a set of general purpose registers--memory cells located physically on the die, thus providing low latency access. They are used for temporary storage of data that a processor operates on or data that is frequently accessed (if the amount of general purpose registers allows this). The amount and bit size of registers on an Intel CPU vary in accordance with the current mode of operation. An Intel CPU has at least two modes:

Real mode: This is the good old DOS mode. When the processor is powered up, it starts in the real mode, which has certain limitations, such as the size of the address bus, which is only 20 bits, and the segmented memory space.
Protected mode: This was first introduced in 80286. This mode provides access to larger amount of memory, as it uses different memory segmentation mechanisms. Paging, introduced in 80386, allows even easier memory addressing virtualization.

Since about 2003, we also have the so-called long mode--64-bit registers/addressing (although, not all 64 bits are used for addressing yet), flat memory model, and RIP-based addressing (addressing relative to the instruction pointer register). In this book, we will work with 32-bit protected (there is such a thing as the 16-bit protected mode, but that is out of scope) and Long, which is a 64-bit mode of operation. The long mode may be considered a 64-bit extension of the protected mode, which evolved from 16-bit to 32-bit. It is important to know that registers accessible in the earlier mode are also accessible in the newer mode, meaning that the registers that were accessible in the real mode are also accessible in the protected mode, and that registers accessible in the protected mode would also be accessible in the long mode (if long mode is supported by the processor). There are a few exceptions regarding the bit width of certain registers and we will look at this soon in this chapter. However, since 16-bit modes (real and 16-bit protected modes) are no longer used by application developers (with minor possible exceptions), in this book, we will work on protected and long modes only.

General purpose registers

Depending on the mode of the operation (protected or long), there are 8 to 16 available general purpose registers in modern Intel processors. Each register is divided into subregisters, allowing access to data with a bit width lower than the width of the register.

The following table shows general purpose registers (further referred to as GPR):

Table 1: x86/x86_64 registers

All R* registers are only available in the long mode. Registers SIL, DIL, BPL, and SPL are only available in the long mode. Registers AH, BH, CH, and DH cannot be used in instructions that are not valid outside the long mode.

For convenience, we will refer to the registers by their 32-bit names (such as EAX, EBX, and so on) when we do not need to explicitly refer to a register of a certain bit width. The preceding table shows all general purpose registers available on the Intel platform. Some of them are only available in the long mode (all 64-bit registers, R* registers, and a few of the 8-bit registers) and certain combinations are not allowed. However, despite the fact that we can use those registers for any purpose, some of them do have a special meaning in certain circumstances.

Accumulators

The EAX register is also known as an accumulator and is used with multiplication and division operations, both as implied and target operands. It is important to mention that the result of a binary multiplication is twice the size of the operands and the result of a binary division consists of two parts (quotient and remainder), each of which has the same bit width as the operands. Since the x86 architecture began with 16-bit registers and for the sake of backward compatibility, the EDX register is used for storing partial results when the values of the operands are larger than could fit into 8 bits. For example, if we want to multiply two bytes, 0x50 and 0x04, we would expect the result to be 0x140, which cannot be stored in a single byte. However, since the operands were 8 bits in size, the result is stored into the AX register, which is 16 bits. But if we want to multiply 0x150 by 0x104, the result would need 17 bits to be stored (0x150 * 0x104 = 0x15540) and, as we have mentioned already, the first x86 registers were only 16 bits. This is the reason for using an additional register; in the case of the Intel architecture, this register is EDX (to be more precise, only the DX part would be used in this specific case). As a verbal explanation may sometimes be too generalized, it would be better to simply demonstrate the rule.

Operand size	Source 1	Source 2	Destination
8 bits (byte)	AL	8-bit register or 8-bit memory	AX
16 bits (word)	AX	16-bit register or 16-bit memory	DX:AX
32 bits (double word)	EAX	32-bit register or 32-bit memory	EDX:EAX
64 bits (quad word)	RAX	64-bit register or 64-bit memory	RDX:RAX

Division implies a slightly different rule. To be more precise, this is the inverted multiplication rule, meaning that the result of the operation is half the bit width of the dividend, which in turn means that the largest dividend in the long mode may be 128-bit wide. The smallest dividend value remains the same as in the smallest value of the source operand in the case of multiplication--8 bits.

Operand size	Dividend	Divisor	Quotient	Remainder
8/16 bits	AX	8-bit register or 8-bit memory	AL	AH
16/32 bits	DX:AX	16-bit memory or 16-bit register	AX	DX
32/64 bits	EDX:EAX	32-bit register or 32-bit memory	EAX	EDX
64/128 bits	RDX:RAX	64-bit register or 64-bit memory	RAX	RDX

Counter

ECX register - also known as counter register. This register is used in loops as a loop iteration counter. It is first loaded with a number of iterations, and then decremented each time the loop instruction is executed until the value stored in ECX becomes zero, which instructs the processor to break out of the loop. We can compare this to the do{...}while() clause in C:

int ecx = 10; 
do 
{ 
    // do your stuff 
    ecx--; 
}while(ecx > 0);

Another common usage of this register, actually the usage of its least significant part, CL, is in bitwise shift operations, where it contains the number of bits in which the source operand should be shifted. Consider the following code, for example:

mov eax, 0x12345
mov cl, 5 
shl eax, cl

This would result in the register EAX being shifted 5 bits to the left (having the value of 0x2468a0 as a result).

Stack pointer

An ESP register is the stack pointer. This register, together with the SS register (the SS register is explained a bit later in this chapter), describes the stack area of a thread, where SS contains the descriptor of the stack segment and ESP is the index that points to the current position within the stack.

Source and destination indices

ESI and EDI registers serve as source and destination index registers in string operations, where ESI contains the source address and EDI, obviously, the destination address. We will talk about these registers a bit more in Chapter 3, Intel Instruction Set Architecture (ISA).

Base pointer

EBP. This register is called the base pointer as its most common use is to point to the base of a stack frame during function calls. However, unlike the previously discussed registers, you may use any other register for this purpose if needed.

Another register worth mentioning here is EBX, which, in the good old days of 16-bit modes (when it was still just a BX register), was one of the few registers that we could use as a base for addressing. Unlike EBP, EBX was (in the case of the XLAT instruction, which by default uses DS:EBX, still is) intended to point to a data segment.

Instruction pointer

There is one more special register that cannot be used for data storage--EIP (IP in the real mode or RIP in the long mode). This is the instruction pointer and contains the address of the instruction after the instruction currently being executed. All instructions are implicitly fetched from the code segment by the CPU; thus the full address of the instruction following the one being executed should be described as CS:IP. Also, there is no regular way to modify its content directly. It is not impossible, but we can't just use a mov instruction in order to load a value into EIP.

All the other registers have no special meaning from the processor's perspective and may be used for any purpose.

Floating point registers

The CPU itself has no means for floating point arithmetic operations. In 1980, Intel announced the Intel 8087 - the floating point coprocessor for the 8086 line. 8087 remained as a separate installable device until 1989, when Intel came up with the 80486 (i486) processor, which had an integrated 8087 circuit. However, when talking about floating point registers and floating point instructions, we still refer to 8087 as a floating-point unit (FPU) or, sometimes, still as a floating-point coprocessor (however, the latter is becoming more and more rare).

8087 has eight registers, 80 bits each, arranged in a stack fashion, meaning that operands are pushed onto this stack from the memory and results are popped from the topmost register to the memory. These registers are named ST0 to ST7 (ST--stack) and the most used one, that is, the ST0 register, may be referred to as simply ST.

The floating-point coprocessor supports several data types:

80-bit extended-precision real
64-bit double-precision real
32-bit single-precision real
18-digit decimal integer
64-bit binary integer
32-bit binary integer
16-bit binary integer

The floating-point coprocessor will be discussed in more detail in Chapter 3, Intel Instruction Set Architecture (ISA).

XMM registers

The 128-bit XMM registers are part of the SSE extension (where SSE is short for Streaming SIMD Extension, and SIMD, in turn, stands for single instruction multiple data). There are eight XMM registers available in non -64-bit modes and 16 XMM registers in long mode, which allow simultaneous operations on:

16 bytes
eight words
four double words
two quad words
four floats
two doubles

We will pay much more attention to these registers and the technology behind them in Chapter 5, Parallel Data Processing.

Segment registers and memory organization

Memory organization is one of the most important aspects of CPU design. The first thing to note is that when we say "memory organization", we do not mean its physical layout on memory chips/boards. For us, it is much more important how the CPU sees memory and how it communicates with it (on a higher level, of course, as we are not going to dive into the hardware aspects of the architecture).

However, as the book is dedicated to application programming, rather than operating system development, we will further consider the most relevant aspects of memory organization and access in this section.

Real mode

Segment registers are a rather interesting topic, as they are the ones that tell the processor which memory areas may be accessed and how exactly they may be accessed. In real mode, segment registers used to contain a 16-bit segment address. The difference between a normal address and segment address is that the latter is shifted 4 bits to the right when stored in the segment register. For example, if a certain segment register was loaded with the 0x1234 value, it, in fact, was pointing to the address 0x12340; therefore, pointers in real mode were rather offsets into segments pointed to by segment registers. As an example, let's take the DI register (as we are talking about a 16-bit real mode now), which is used with the DS (data segment) register automatically, and load it with, let's say, 0x4321 when the DS register is loaded with the 0x1234 value. Then the 20-bit address would be 0x12340 + 0x4321 = 0x16661. Thus, it was possible to address at most 1 MB of memory in real mode.

There are in total six segment registers:

CS: This register contains the base address of the currently used code segment.
DS: This register contains the base address of the currently used data segment.
SS: This register contains the base address of the currently used stack segment.
ES: This is the extra data segment for the programmer's use.
FS and GS: These were introduced with the Intel 80386 processor. These two segment registers have no specific hardware-defined function and are for the programmer's use. It is important to know that they do have specific tasks in Windows and Linux, but those tasks are operating system dependent only and have no connection to hardware specifications.

The CS register is used together with the IP register (the instructions pointer, also known as the program counter on other platforms), where the IP (or EIP in protected mode and RIP in long mode) points to the offset of the instruction in the code segment following the instruction currently being executed.

DS and ES are implied when using SI and DI registers, respectively, unless another segment register is implicitly specified in the instruction. For example, the lodsb instruction, although, it is written with no operands, loads a byte from the address specified by DS:SI into the AL register and the stosb instruction (which has no visible operands either) stores a byte from the AL register at the address specified by ES:DI. Using SI/DI registers with other segments would require explicitly mentioning those segments with the relevant segment register. Consider the following code, for example:

mov ax, [si] 
mov [es:di], ax

The preceding code loads a double word from the location pointed by DS:SI and stores it to another location pointed by ES:DI.

The interesting thing about segment registers and segments at all is that they may peacefully overlap. Consider a situation where you want to copy a portion of code to either another place in the code segment or into a temporary buffer (for example, for decryptor). In such a case, both CS and DS registers may either point to the same location or the DS register may point somewhere into the code segment.

Protected mode - segmentation

While it was all fine and simple in real mode, things become a bit more complicated when it comes to protected mode. Unfortunately, memory segmentation is still intact, but the segment register no longer contain addresses. Instead, they are loaded with the so-called selectors, which are, in turn, the indices into the descriptor table multiplied by 8 (shifted 3 bits to the left). The two least significant bits designate the requested privilege level (0 for kernel space to 3 for user land). The third bit (at index 2) is the TI bit (table indicator), which indicates whether the descriptor being referred is in a global descriptor table (0) or in a local descriptor table (1). The memory descriptor is a tiny 8-byte structure, which describes the range of physical memory, its access rights, and some additional attributes:

Table 2: Memory descriptor structure

Descriptors are stored in at least two tables:

GDT: Global descriptor table (used by the operating system)
LDT: Local descriptor table (per task descriptor table)

As we may conclude, the organization of memory in protected mode is not that different from that in real mode after all.

There are other types of descriptors--interrupt descriptors (stored in the interrupt description table (IDT)) and system descriptors; however, since these are in use in kernel space only, we will not discuss them, as that falls out of the scope of this book.

Protected mode - paging

Paging is a more convenient memory management scheme introduced in 80386 and has been a bit enhanced since then. The idea behind paging is memory virtualization--this is the mechanism that makes it possible for different processes to have the same memory layout. In fact, the addresses we use in pointers (if we are writing in C, C++, or any other high-level language that compiles into native code) are virtual and do not correspond to physical addresses. The translation of a virtual address into a physical address is implemented in hardware and is performed by the CPU (however, some operating system interventions are possible).

By default, a 32-bit CPU uses a two-level translation scheme for the derivation of a physical address from the supplied virtual one.

The following table explains how a virtual address is used in order to find a physical address:

Address bits	Meaning
0 - 11	Offset into a 4 KB page
12 - 21	Index of the page entry in the table of 1024 pages
22 - 31	Index of the page table entry in a 1024-entries page directory

Table 3: Virtual address to physical address translation

Most, if not all, modern processors based on the Intel architecture also support Page Size Extension (PSE), which makes it possible to use the so-called large pages of 4 MB. In this case, the translation of a virtual address into a physical address is a bit different, as there is no page table any more. The following table shows the meaning of bits in a 32-bit virtual address:

Address bits	Meaning
0 - 21	Offset into a 4 MB page
22 - 31	Index of the corresponding entry in a 1024-entries page directory

Table 4: Virtual address to physical address translation with PSE enabled

Furthermore, the Physical Address Extension (PAE) was introduced, which significantly changes the scheme and allows access to a much bigger range of memory. In protected mode, PAE adds a page directory pointer table of four entries and the virtual to physical address conversion would be as per the following table:

Address bits	Meaning
0 - 11	Offset into a 4 KB page
12 - 20	Index of a page entry in the table of 512 pages
21 - 29	Index of a page table entry in a 512-entries page directory
30 - 31	Index of a page directory entry in a four-entries page directory pointer table

Table 5: Virtual to physical address translation with PAE enabled (no PSE)

Enabling PSE in addition to PAE forces each entry in the page directory to point directly to a 2 MB page instead of an entry in a page table.

Long mode - paging

The only address virtualization allowed in long mode is paging with PAE enabled; however, it adds one more table--the page map level 4 table as the root entry; therefore, the conversion of a virtual address to a physical address uses the bits of a virtual address in the way described in the following table:

Address bits	Meaning
0 - 11	Offset into a 4 KB page
12 - 20	Index of a page entry in the table of 512 pages
21 - 29	Index of a page table entry in the page directory
30 - 38	Index of a page directory entry in the page directory pointer table
39 - 47	Index of a page directory pointer table in the page-map level 4 table

Table 6: Virtual to physical address translation in long mode

It is, however, important to mention that despite the fact that it is a 64-bit architecture, the MMU only uses the first 48 bits of the virtual address (also called the linear address).

The whole process of address resolution is performed by the memory management unit (MMU) in the CPU itself, and the programmer is only responsible for actually building these tables and enabling PAE/PSE. However, this topic is much wider than may be covered in a single chapter and falls a bit out of the scope of this book.

Control registers

Processors based on the Intel architecture have a set of control registers that are used for configuration of the processor at run time (such as switching between execution modes). These registers are 32-bit wide on x86 and 64-bit wide on AMD64 (long mode).

There are six control registers and one Extended Feature Enable Register (EFER):

CR0: This register contains various control flags that modify the basic operation of the processor.
CR1: This register is reserved for future use.
CR2: This register contains the Page Fault Linear Address when a page fault occurs.
CR3: This register is used when virtual addressing is enabled (paging) and contains the physical address of the page directory, page directory pointer table, or page map level 4 table, depending on the current mode of operation.
CR4: This register is used in the protected mode for controlling different options of the processor.
CR8: This register is new and is only available in long mode. It is used for prioritization of external interrupts.
EFER: This register is one of the several model-specific registers. It is used for enabling/disabling SYSCALL/SYSRET instructions, entering/exiting long mode, and a few other features. Other model-specific registers are of no interest for us.

However, these registers are not accessible in ring3 (user land).

Debug registers

In addition to control registers, processors also have a set of so-called debug registers, which are mostly used by debuggers for setting the so-called hardware breakpoints. These registers are in fact a very powerful tool when it comes to control over other threads or even processes.

Debug address registers DR0 - DR3

Debug registers 0 to 3 (DR0, DR1, DR2, and DR3) are used to store virtual (linear) addresses of the so-called hardware breakpoints.

Debug control register (DR7)

DR7 defines how the breakpoints set in Debug Address Registers should be interpreted by the processor and whether they should be interpreted at all.

The bits layout of this register is shown in the following table:

Table 3: DR7 bit layout

L* bits, when set to 1, enable breakpoint at the address which is specified in the corresponding Debug Address Register locally--within a task. These bits are reset by the processor on each task switch. G* bits, on the contrary, enable breakpoints globally--for all tasks, meaning that these bits are not reset by the processor.

The R/W* bits specify breakpoint conditions, as follows:

00: Break on instruction execution
01: Break when the specified address is accessed for writing only
10: Undefined
11: Break on either read or write access or when an instruction at the specified address is executed

The LEN* bits specify the size of a breakpoint in bytes, thus, allowing coverage of more than one instruction or more than one byte of data:

00: Breakpoint is 1-byte long
01: Breakpoint is 2-bytes long
10: Breakpoint is 8-bytes long (long mode only)
11: Breakpoint is 4-bytes long

Debug status register (DR6)

When an enabled breakpoint is triggered, the corresponding bit of the four low-order bits in DR6 is set to 1 before entering the debug handler, thus, providing the handler with information about the triggered breakpoint (bit 0 corresponds to the breakpoint in DR0, bit 1 to the breakpoint in DR1, and so on).

The EFlags register

It would have been impossible to write programs in any language for a given platform if the processor had no means to report its status and/or the status of the last operation. More than that, the processor itself needs this information from time to time. Try to imagine a processor unable to conditionally control the execution flow of a program--sounds like a nightmare, doesn't it?

The most common way for a program to obtain information on the last operation or on a certain configuration of an Intel-based processor is through the EFlags register (E stands for extended). This register is referred to as Flags in real mode, EFlags in protected mode, or RFlags in long mode.

Let's take a look at the meaning of the individual bits (also referred to as flags) of this register and its usage.

Bit #0 - carry flag

The carry flag (CF) is mostly used for the detection of carry/borrow in arithmetic operations and is set if the bit width result of the last such operation (such as addition and subtraction) exceeds the width of the ALU. For example, the addition of two 8-bit values, 255 and 1, would result in 256, which requires at least nine bits to be stored. In such a case, bit eight (the ninth bit) is placed into the CF, thus, letting us and the processor know that the last operation ended with carry.

Bit #2 - parity flag

The parity flag (PF) is set to 1 in case the number of 1s in the least significant byte is even; otherwise, the flag is set to zero.

Bit #4 - adjust flag

The adjust flag (AF) signals when a carry or borrow occurred in the four least significant bits (lower nibble) and is primarily used with binary coded decimal (BCD) arithmetics.

Bit #6 - zero flag

The zero flag (ZF) is set when the result of an arithmetic or bitwise operation is 0. This includes operations that do not store the result (for example, comparison and bit test).

Bit #7 - sign flag

The sign flag (SF) is set when the last mathematical operation resulted in a negative number; in other words, when the most significant bit of the result was set.

Bit #8 - trap flag

When set, the trap flag (TF) causes a single step interrupt after every executed instruction.

Bit #9 - interrupt enable flag

The interrup enable flag (IF) defines whether processor will or will not react to incoming interrupts. This flag is only accessible in real mode or at the Ring 0 protection level in other modes.

Bit #10 - direction flag

The direction flag (DF) controls the direction of string operations. An operation is performed from the lower address to the higher address if the flag is reset (is 0) or from the higher address to the lower address if the flag is set (is 1).

Bit #11 - overflow flag

The overflow flag (OF) is sometimes perceived as two's complement form of the carry flag, which is not really the case. OF is set when the result of the operation is either too small or too big a number to fit into the destination operand. For example, consider the addition of two 8-bit positive values, 0x74 and 0x7f. The resulting value of such an addition is 0xf3, which is still 8-bit, which is fine for unsigned numbers, but since we added two values that we considered to be signed, there has to be the sign bit and there are no more bits to store the 9-bit signed result. The same would happen if we try to add two negative 8-bit values, 0x82 and 0x81. The meaning of the addition of two negative numbers is in the fact subtraction of a positive number from a negative number, which should result in an even smaller number. Thus, 0x82 + 0x81 would result in 0x103, where the ninth bit, 1, is the sign bit, but it cannot be stored in an 8-bit operand. The same applies to larger operands (16, 32, and 64-bit).

Remaining bits

The remaining 20 bits of the EFlags register are not that important for us while in user-land except, probably, the ID bit (bit #21). The ID flag indicates whether we can or cannot use the CPUID instruction.

Bits 32 - 63 of the RFlags register in long mode would be all 0s.

Rui Craveiro Jan 24, 2024

This book is really complete, but it is better as a reference material than a learning tool. The author failed to take the reader's hand and spend much more time walking him through the fundamentals instead of providing all the broad details the book provides. I suspended reading it for a more pedagogical alternative, but I will get back to it once I already master the basics.

Feefo Verified review