Windows Malware Analysis Essentials

Chapter 1. Down the Rabbit Hole

Before we get started with analyzing malware, you need to start at the baseline, which will involve reviewing some fundamental tenets of computer science. Malware analysis essentially deals with an in-depth investigation of a malicious software program, usually in some binary form procured through collection channels/repositories/infected systems or even your own Frankenstein creations in a lab. In this book, we focus on Windows OS malware and the myriad methods and the inventory required for their analyses. Much like a time and space tradeoff for computer algorithms (and the infinite monkeys with typewriters paradigm), the analyst must be aware that given enough time, any sample can be analyzed thoroughly, but due to practical constraints, they must be selective in their approach so that they can leverage the existing solutions to the fullest without compromising on the required details. If churning out anti-virus signatures for immediate dispersal to client systems is the priority, then finding the most distinguishing characteristic or feature in the sample is a top priority. If network forensics is the order of the day, then in-depth packet traces and packet analyses must be carried out. If it's a memory-resident malware, then malware memory forensics has to be dealt with. Likewise, in unpacking an armored sample, fixing the imports/exports table to get a running executable might not be the best use of your time, as if the imports are functional in memory and the details are available, investigation of the Modus Operandi (MO) must be the primary focus and not memory carving, particularly if time is a factor. Perfectionism in any process has its benefits and liabilities. Malware analysis is both a science and an art. I believe it is more like a craft wherein the tools get the work done if you know how to use them creatively, like a sculptor who has a set of mundane chisels to remove stone chips and etch a figure of fantasy out of it. As any artist worth his salt would say, he is still learning his craft.

The primary topics of interest for this primer are as follows:

Number systems
Base conversion
Signed numbers and complements
Boolean logic and bit masks
Malware analysis tools
Entropy

The motivation behind these topics is simple: if these fundamentals are not clear, reading hex dumps and deciphering assembly code will be a pain in the neck. It is vital that you know these topics like the back of your hand. More importantly, I believe that understanding the concepts behind them may help you understand computers as a whole more intimately in order to deal with more complex problems later on. There is no silver bullet for malware analysis methodologies as quite a lot of problems that surface are related to computing boundaries and are NP-complete, much like an irreversible chemical process or an intractable problem. You will be using debuggers, disassemblers, monitoring software, visualization, data science, machine learning, regular expressions (automata), automation, virtualization, system administration, the software development tool chain and system APIs, and so on. Thus, you have a set of tools that enable you to peek into the coexisting layers and a set of techniques that enable you to use these tools to an optimum level. Also, you have to wear many hats—things like forensics, penetration testing, reverse engineering, and exploit research blur the line when it comes to malware technologies that are in vogue, and you have to keep up. The rest comes with experience and tons of practice (10,000 hours to mastery according to Outliers by Malcolm Gladwell). There is no shortcut to hard work, and shortcuts can be dangerous, which ironically is learned from experience many times. The primer will be quick, and it will be assumed that you have a solid understanding of the topics discussed before you read the following chapters, particularly x86/x64 assembly and disassembly. From here, you will proceed to x86/x64 assembly programming and analysis, static and dynamic malware analysis, virtualization, and analysis of various malware vectors.

Number systems

The number system is a notational scheme that deals with using a symbol to represent a quantity.

A point to ponder: We know that a quantity can be both finite and infinite. In the real world, many things around us are quantifiable and can be accounted for. Trees in a garden are finite. The population of a country is finite. In contrast, sand particles are seemingly infinite (by being intractable and ubiquitous). Star systems are seemingly infinite (by observation). Prime number sequences are infinite (by conjecture). It is also understood that tangible and intangible things exist in nature in both finite and infinite states. A finite entity can be made infinite just by endless replication. An infinite and intangible entity can be harnessed as a finite unit by giving it a name and identity. Can you think of some examples in this universe (for example, is this one of many universes or is it the only one and infinitely expanding)?

In my experience, there is a lot of confusion regarding number systems, even with some experienced IT folk. Quantities and the representation of these quantities such as symbols/notations are primarily separate entities. A notation system and what it represents are completely different things, although because of ubiquity and visibility, the meanings are exchanged and we take it for granted that they are both one and the same, and that creates the confusion. We normally count using our fingers because it seems natural to us. We have five digits per hand and they can be utilized to count up to 10 units. So, we developed a decimal counting system. Note that the numbers 0 to 9 constitute the whole symbol set for whole numbers. While defining a symbol set, although we use the symbols that are designed through the centuries that have passed and have their place, it is not mandatory to define numbers only in that particular set. Nothing prevents us from developing our own symbol set to notate quantities.

An example symbol set = {NULL, ALPHA, TWIN, TRIPOD, TABLE}, and we can substitute pictures in the above set, which directly map to {0, 1, 2, 3, 4}. Can you think of your own symbol set?

The largest number in the symbol set (9) is one less than that of the base (10). Also, zero was a relatively late addition, with many cultures not realizing that null or nothing can also be symbolized. Using zero in a number system is the crux to developing a position-based encoding scheme. You can only occupy something where there is a void that acts as a container, so to speak. So, you can think of 0 as a container for symbols and as a placeholder for new positions. In order to count 10 objects, we reuse the first two symbols from the symbol set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. This is key to understanding number systems in vogue. What happens is that each number position/column is taken as an accumulator. You start at your right and move towards the left. The value in the first column changes in place as long as the symbol set has a symbol for it.

When the maximum value is achieved, a new column toward the left is taken and the counting continues with the initial symbols in the symbol set. Thus, think of each column as a bucket that is a larger bucket than the one to the right of it. Further, each new column represents the largest quantity of the last column. Here, the number of columns used or the number of symbol places denotes the range of quantities that can be represented. We can only use the symbols in the symbol set. Thus, if we had a set of infinite symbols for each quantity, we would not have to reuse the symbols to represent larger quantities, but that would be very unwieldy as we humans don't have a very good memory span.

To reiterate, think of the columns as containers. Once you are out of symbols for that particular column, you reuse the first symbol greater than zero. Thereafter, you reset the previous column to zero, and start increasing the symbol count till it reaches the maximum in the set. You then repeat the process for the specific quantity to be represented. Study the following image to gain more understanding visually:

You can also look at the number system notation as a histogram-based notation that uses symbols instead of rectangles, wherein the quantity is represented by the total count in a compact representation. The histogram is essentially a statistical tool to find the percentage of an entity or a group of entities and other control points such as features of the entities in a population that contains the entities.

Think of it as a frequency count of a particular entity. Here, each entity refers to the base power group that each digit towards the left represents.

So, taken as a summation of weights, each position that can be seen as representing a total Frequency Count of how many of that position's relative quantity. Instead of drawing 15 lines to denote 15 objects, we use the symbols 1 and 5 to denote 5 objects and 10 more, with 5 joining the 1 and then taking the place of 0, which acts as a container or placeholder to give the combined notation of 15.

For a larger number such as 476, you can see this notation as a count of how many 100s, 10s, and the core symbol set values. So, 400 = 4 * 100 or there are 4 hundreds, and 7 * 10 or that there are 7 tens and 6 values more. The reason you add these up is because they each represent a part of the total.

Can you repeat the process with a new base? Why don't you try base 3? The solution will be given later in this chapter, but you must try it yourself first.

Have you ever wondered why the base 8 (octal) does not have the numbers 8 and above in its notation? Use the same rules that you have just read to reason why this notation system works the way it does. It follows the same number symbol-based position-relative notation. You can also think of this as weights being attached to the positions as they are positioned towards the left. Finally, as each row signifies a count of the next quantity, you essentially sum up all the position values according to their weights.

We are accustomed to using the above formula as an automated response for counting numbers without ever giving much thought to the reasoning behind this notational system. It's taken for granted that you never question it.

The hexadecimal base notation also works in the same way. The reasoning behind using 16 as a quantity belies the fact that a permutation of 2 symbols {0, 1} to a length of 4 gives us 16 different patterns. Since 4 bits used as a basic block works for grouping bit sequences as per engineering conventions, the nibble is the smallest block unit in computing. The bit is the smallest individual unit present. The minimum value is 0000 and the largest value is 1111. These unique patterns are represented using symbols from the alphabet and the numbers 0 to 9.

You can replace the alphabet symbols A to F with any shape, picture, pattern, Greek letter, or visual glyph. It's just that the alphabets are already a part of our communication framework, so it makes sense to reuse them. So, the convention of grouping 4 bits to form a pattern using a notation that expresses the same thing provides a much more convenient way to look at binary information. Since our visual acuity is much sharper when we form groups and patterns, this system works for engineers and analysts, who need to work with binary data (in a domain-agnostic manner) on a regular basis. Note that as per convention, hexadecimal values are prefixed with 0x or post-fixed with H to denote hexadecimal notation.

The hexadecimal symbol set = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F}

Permutations are also the foundational mathematics behind data type representation. So, we have taken 4 bits to form a nibble and 8 bits to form a byte. What is a byte? Taken simply, it is a series of symbols from the set {0, 1} to a length of 8, which represents the permutated value of that particular sequence as a quantity. It could also be further used to represent a packed data type where each bit position denotes a toggle of a value as on or off, similar to an array of switches. Since we work with programming languages, the binary data types are of a primary interest as they work like an index into a table where the entire range from the minimum to the maximum is already worked out as part of a finite series. Thus, the bit pattern 00001111 gives the value of 15 out of a total of 2^8 values. Why 2^8? This is because when you need to compute unique values out of a symbol set to a specific length, you take the total number of symbols to the power of the length to get the maximum value. However, you also have to take into account the primary conditions for permutations and its difference from combinations all relating to symbol usage and being ordered or not. As a rule, to reuse the symbols in a specific order, you can take powers, as in the case of permutations. However, if using a symbol removes it from the participation of the next position's symbol set, you need to take factorials, as in the case of combinations. They all fall into a branch of mathematics called Combinatorics. Likewise, do you see the logic behind primitive data types such as int, short, char, and float? When using custom data types, such as structs and classes, you are effectively setting up a binary data structure to denote a specific data type that could be a combination of primitive data types or user-defined ones. Since the symbol set is the same for both primitive/primary and data types, it is the length of the data structure assigned per data type that gives meaning to the structure.

For a simple exercise, find the unique ways in which you can arrange the letters {A, B, C}, where each symbol can be reused to a length of 3, that is, each position can use any symbol from the set above. Thereafter, find the unique ways in which you can combine the symbols, without repeating any previous pattern but in any sequence. You will find that you get 27 patterns from the first exercise and 6 patterns from the second. Now, build a formula or try to model this pattern. You get (base^(pattern length)) and factorial (base). This is how binary notation is used to encode quantities, which are being denoted by symbols (which can also be mapped to a scheme), which in turn are based on the principles of human language, and therefore, all information can be encoded in this manner.

Computers do not even understand the symbol 1 (ASCII 0x31) and the symbol 0 (ASCII 0x30). They only work with voltage levels and logic gates as well as combined and sequential circuits such as D flip-flops for memory. This complex dance is orchestrated by a clock that sets things in motion (a regular pulse of n cycles/s aids in encoding schemes, in much the same way as in music, the rhythm brings predictability and stability that greatly simplifies encoding/decoding and transmission); much like a conductor, in tandem with the microprocessor, which provides built-in instructions that can be used to implement the algorithm given to it. The primary purpose of using various notation systems is that doing so makes it more manageable for us to work with circuit-based logic and provides an interface that looks familiar so that humans can communicate with the machine. It's all just various layers of abstraction.

The following table shows how the base 3 notation scheme can be worked out:

Can you write a program to display the base notation of bases from 2 to 16? A maximum of base 36 is possible by using alphabets and numbers as the symbol set after which new symbols or combinations of existing symbols have to be used to map a new symbol set. I think that this would be a great exercise in programming fundamentals.

Base conversion

You have seen how the positional notation system is used to represent quantities. How do you work with the myriad bases that are developed from this scheme? Converting decimal to binary and binary to hexadecimal or vice versa in any combination must be a workable task in order to successfully make use of the logic framework and communicate with the system.

Binary to hexadecimal (and vice versa)

This is the simplest base conversion method once you get the hang of it. Each hexadecimal digit maps directly to a specific binary pattern. Dividing any binary pattern into multiples of 4 gives us the corresponding hexadecimal form. If less than 4 bits are used, 0 is left padded (for instance, 11 0011 0101 gets left padded to 0011 0011 0101 in order to get 3 nibbles) to get it to 4 bits or a multiple length thereof. Likewise, for larger lengths but ending at odd positions, zero is padded again to get the length of a multiple of 4. Remember that each character in the hexadecimal representation is a nibble. Hence, larger composite data types are grouped according to the data type length. WORD has 2 bytes, and DWORD has 4 bytes. These terms relate to data types or for our purposes, the number of bits used to collectively represent a unit of data—exhibiting properties of the total pattern quantity count and the placeholders for each of the individual patterns. These directly map to a value in the data type range; for instance, a pattern length of 16 bits is conventionally called a WORD, which gives a total pattern value count of 2^16 values, and the value 2, for instance, can be represented in 16 bits as 0000 0000 0000 0010, which directly maps to the value 2 from a range of 0 to 65,535. The processor WORD is normally the most fundamental data unit that is used in the processor architecture. In IA-32, the natural or processor word is taken as 32-bit units and other data types derived from it. It can also conventionally mean the type of an integer implemented in the architecture. Refer to https://en.wikipedia.org/wiki/Word_(computer_architecture) for a more general overview. Similarly, for any hexadecimal number, just map each of its characters to the 16 different binary values and concatenate them in order to get the resulting binary sequence.

1111 1101 <-> FDh [byte data type]

Decimal to binary (and vice versa)

Binary to decimal is achieved by adding the weights for each bit position that is set and adding them up.

Decimal to binary requires you to divide the number by 2 and set the symbol for any remainder and 0 for no remainder after every step, and recursively do the division till you get to 2 or below as the dividend. Essentially, you take stock of the modulus of the entire process in a stack data structure and concatenate them in reverse to get the resulting binary value.

For instance, to convert 9 decimal to binary, notice the modulus or remainder:

9/2 = 4 with remainder 1
4/2 = 2 with remainder 0
2/2 = 1 with remainder 0
½ = 0 with remainder 1

Reading in reverse, that is, bottom to top, we get 1001, which if you multiply the places in powers of 8 would yield 1 * 2^3 + 0 + 0 + 1 * 2^1 = 8 + 1 = 9. Mapping 1001 to hexadecimals will still give you 0x9 as after that, the symbol set for quantities above 9 is letters.

The divisions by base till you reach the base value and record the modulus method as well as the add the integer powers of the base to get the result method are the most prevalent in computing and work with every base that subscribes to this positional notational system.

Try doing converting decimal values to hexadecimals (Hint: Divide by 16 and take the modulus/add the powers of each hexadecimal character decimal value (nibble representation) and multiply with each power of 16.).

Octal base conversion

Octal is a legacy form and is not used much nowadays in our current technological setup. However, now, you know how to deal with it. The simple way to break a binary pattern into its octal representation is to group the bits into groups of three and write the decimal number for that pattern. Why 3 you ask? It is octal, so 8 is the base of the notation. Taking a binary of length 3 and setting each bit position to 1 each to get 111 gives us 7 in decimal. This is the maximum value that will be represented by the symbol set (remember how the positional/placeholder-based notation works). Thus, number symbol patterns of a length of 3 places are enough to realize the entire symbol set of the octal base. Hence, you start by grouping bits into groups of length 3.