Python for Secret Agents

5 (2 reviews total)
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies

About this book

Python is an easy-to-learn and extensible programming language that allows secret agents to work with a wide variety of data in a number of ways. It gives beginners a simple way to start programming, but Python's standard library also provides numerous packages that allow Python-using secret agents to easily utilize very sophisticated information processing.

This book will guide new field agent trainees through putting together a Python-based toolset to gather, analyze, and communicate data. It starts by covering the basics and then moves on to sections such as file exchange, image processing, geocoding, simple trigonometry, and more sensitive statistical processing. You will then learn how to use polynomials to encode and decode data in different representations. Furthermore, this book shows you how to add tools to a Python environment, work with images, and parse HTML web pages to extract meaningful data. The idea of adding packages to Python is central to how an agent will leverage these tools for data processing.

Publication date:
August 2014
Publisher
Packt
Pages
216
ISBN
9781783980420

 

Chapter 1. Our Espionage Toolkit

The job of espionage is to gather and analyze data. This requires us to use computers and software tools.

The ordinary desktop tools (word processor, spreadsheet, and so on) will not measure up for the kinds of jobs we need to tackle. For serious data gathering, analysis, and dissemination, we need more powerful tools. As we look at automating our data gathering, we can't easily use a desktop tool that requires manual pointing and clicking. We want a tool that can be left to run autonomously working for us without anyone sitting at a desk.

One of the most powerful data analysis tools we can use is Python. We're going to step through a series of examples of real data collection and analysis using Python. This chapter will cover the following topics:

  • Firstly, we're going to download and install the latest and greatest Python release.

  • We're going to supplement Python with the easy_install (or pip) tools to help us gather additional software tools in the later chapters.

  • We'll look at Python's internal help system briefly.

  • We'll look at how Python works with numbers. After all, a secret agent's job of collecting data is going to involve numbers.

  • We'll spend some time on the first steps of writing Python applications. We'll call our applications scripts as well as modules.

  • We'll break file input and output into several sections. We will have a quick overview as well as an in-depth look at ZIP archive files. In the later chapters, we'll look at more kinds of files.

  • Our big mission will be to apply our skills for recovering a lost password for a ZIP file. This won't be easy, but we should have covered enough of the basics to be successful.

This will give us enough Python skills so that we can advance to more complex missions in the next chapters.

 

Getting the tools of the trade – Python 3.3


The first step toward using Python is getting the Python language onto our computer. If your computer uses Mac OS X or Linux, you may already have Python available. At the time of this writing, it's Python 2.7, not 3.3. In this case, we will need to install Python 3.3 in addition to Python 2.7 we already have.

Windows agents generally don't have any version of Python, and need to install Python 3.3.

Tip

Python 3 is not Python 2.7 plus a few features. Python 3 is a distinct language. We don't cover Python 2.7 in this book. The examples won't work in Python 2. Really.

Python downloads are available at http://www.python.org/download/.

Locate the proper version for your computer. There are many pre-built binaries for Windows, Linux, and Mac OS X. Linux agents should focus on the binary appropriate to their distribution. Each download and install will be a bit different; we can't cover all the details.

There are several implementations of Python. We'll focus on CPython. For some missions, you might want to look at Jython, which is Python implemented using the Java Virtual Machine. If you're working with other .NET products, you might need Iron Python. In some cases, you might be interested in PyPy, which is Python implemented in Python. (And, yes, it seems circular and contradictory. It's really interesting, but well outside our focus.)

Python isn't the only tool. It's the starting point. We'll be downloading additional tools in the later chapters. It seems like half of our job as a secret agent is locating the tools required to crack open a particularly complex problem. The other half is actually getting the information.

Windows secrets

Download the Windows installer for Python 3.3 (or higher) for your version of Windows. When you run the installer, you'll be asked a number of questions about where to install it and what to install.

It's essential that Python be installed in a directory with a simple name. Under Windows, the common choice is C:\Python33. Using the Windows directories with spaces in their name (Program Files, My Documents, or Documents and Settings) can cause confusion.

Be sure to install the Tcl/Tk components too. This will assure that you have all the elements required to support IDLE. IDLE is a handy text editor that comes with Python. For Windows agents, this is generally bundled into the installation kit. All you need to do is be sure it has a check mark in the installation wizard.

With Windows, the python.exe program doesn't have a version number as part of its name. This is atypical.

Mac OS X secrets

In Mac OS X, there is already a Python installation, usually Python 2.7. This must be left intact.

Download the Mac OS X installer for Python 3.3 (or higher). When you run this, you will be adding a second version of Python. This means that many add-on modules and tools must be associated with the proper version of Python. This requires a little bit of care. It's not good to use the vaguely-named tool like easy_install. It's important to use the more specific easy_install-3.3, which identifies the version of Python you're working with.

The program named python is usually going to be an alias for python2.7. This, too, must be left intact. We'll always explicitly use python3 (also known as python3.3) for this book. You can confirm this by using the shell command.

Note that there are several versions of Tcl/Tk available for Mac OS X. The Python website directs you to a specific version. At the time of this writing, this version was ActiveTCL 8.5.14 from ActiveState software. You'll need to install this, too. This software allows us to use IDLE.

Visit http://www.activestate.com/activetcl/downloads for the proper version.

 

Getting more tools – a text editor


To create Python applications, we're going to need a proper programmers' text editor. A word processor won't do because the files created by a word processor are too complex. We need simple text. Our emphasis is on simplicity. Python3 works in Unicode without bolding or italicizing the content. (Python 2 didn't work as well with Unicode. This is one of the reasons to leave it behind.)

If you've worked with text editors or integrated development environments (IDEs), you may already have a favorite. Feel free to use it. Some of the popular IDEs have Python support.

Python is called a dynamic language. It's not always simple to determine what names or keywords are legal in a given context. The Python compiler doesn't perform a lot of static checking. An IDE can't easily prompt with a short list of all legal alternatives. Some IDEs do take a stab at having logical suggestions, but they're not necessarily complete.

If you haven't worked with a programmer's editor (or an IDE), your first mission is to locate a text editor you can work with. Python includes an editor named IDLE that is easy to use. It's a good place to start.

The Active State Komodo Edit might be suitable (http://komodoide.com/komodo-edit/). It's a lightweight version of a commercial product. It's got some very clever ways to handle the dynamic language aspects of Python.

There are many other code editors. Your first training mission is to locate something you can work with. You're on your own. We have faith in you.

Getting other developer tools

Most GNU/Linux agents have various C compilers and other developer tools available. Many Linux distributions are already configured to support developers, so the tools are already there.

Mac OS X agents will usually need Xcode. Get it from https://developer.apple.com/xcode/downloads/. Every Mac OS X agent should have this.

When installing this, be sure to also install the command line developer tools. This is another big download above and beyond the basic Xcode download.

Windows agents will generally find that pre-built binaries are available for most packages of interest. If, in a rare case, that pre-built code isn't available, tools such as Cygwin may be necessary. See http://www.cygwin.com.

Getting a tool to get more Python components

In order to effectively and simply download additional Python modules, we often use a tool to get tools. There are two popular ways to add Python modules: PIP and the easy_install script.

To install easy_install, go to https://pypi.python.org/pypi/setuptools/3.6.

The setuptools package will include the easy_install script, which we'll use to add modules to Python.

If you've got multiple versions of Python installed, be sure to download and then install the correct easy install version for Python 3.3. This means that you'll generally be using the easy_install-3.3 script to add new software tools.

To install PIP, go to https://pypi.python.org/pypi/pip/1.5.6.

We'll be adding the Pillow package in Chapter 3, Encoding Secret Messages with Steganography. We'll also be adding the Beautiful Soup package in Chapter 4, Drops, Hideouts, Meetups, and Lairs.

The Python 3.4 distribution should include the PIP tool. You don't need to download it separately.

 

Confirming our tools


To be sure we have a working Python tool, it's best to check things from the command prompt. We're going to do a lot of our work using the command prompt. It involves the least overhead and is the most direct connection to Python.

The Python 3.3 program shows a startup message that looks like this:

MacBookPro-SLott:Secret Agent's Python slott$ python3
Python 3.3.4 (v3.3.4:7ff62415e426, Feb  9 2014, 00:29:34)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type help, copyright, credits or license for more information.
>>>

We've shown the operating system's prompt (MacBookPro-SLott:Secret Agent's Python slott$), the command we entered (python3), and the response from Python.

Python provides three lines of introduction followed by its own >>> prompt. The first line shows that it's Python 3.3.4. The second line shows the tools used to build Python (GCC 4.2.1). The third line provides some hints about things we might do next.

We've interacted with Python. Things are working.

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Feel free to enter copyright, credits, and license at the >>> prompt. They may be boring, but they serve as confirmation that things are working.

It's important to note that these objects (copyright, credits, and license) are not commands or verbs in the Python language. They're global variables that were created as a convenience for first-time Python agents. When evaluated, they display blocks of text.

There are two other startup objects we'll use: exit and help. These provide little messages that remind us to use the help() and exit() functions.

How do we stop?

We can always enter exit to get a reminder on how to exit from interactive Python, as follows:

>>> exit

Use exit() or Ctrl + D (that is EOF (end-of-file)) to exit.

Windows agents will see that they must use Ctrl + Z and Return to exit.

Python is a programming language that also has an interactive prompt of >>>. To confirm that Python is working, we're responding to that prompt, using a feature called the Read-Execute-Print Loop (REPL).

In the long run, we'll be writing scripts to process our data. Our data might be an image or a secret message. The end of a script file will exit from Python. This is the same as pressing Ctrl + D (or Ctrl + Z and Return) to send the EOF sequence.

Using the help() system

Python has a help mode, which is started with the help() function. The help() function provides help on a specific topic. Almost anything you see in Python can be the subject of help.

For pieces of Python syntax, such as the + operator, you'll need to use a string meaning you should enclose the syntax in quotes. For example, help("+") will provide detailed help on operator precedence.

For other objects (such as numbers, strings, functions, classes, and modules) you can simply ask for help on the object itself; quotes aren't used. Python will locate the class of the object and provide help on the class.

For example, help(3) will provide a lot of detailed, technical help on integers, as shown in the following snippet:

>>> help(3)
Help on int object:
class int(object)
|  int(x=0) -> integer
|  int(x, base=10) -> integer
|
etc.

When using the help() module from the command line, the output will be presented in pages. At the end of the first page of output, we see a new kind of non-Python prompt. This is usually :, but on Windows it may be -- More --.

Python normally prompts us with >>> or .... A non-Python prompt must come from one of the help viewers.

Mac OS and GNU/Linux secrets

In POSIX-compatible OSes, we'll be interacting with a program named less; it will prompt with : for all but the last page of your document. For the last page, it will prompt with (END).

This program is very sophisticated; you can read more about it on Wikipedia at http://en.wikipedia.org/wiki/Less_(Unix).

The four most important commands are as follows:

  • q: This command is used to quit the less help viewer

  • h: This command is used to get help on all the commands that are available

  • ˽: This command is used to enter a space to see the next page

  • b: This command is used to go back one page

Windows secrets

In Windows, we'll usually interact with a program named more; it will prompt you with -- More --. You can read more about it on Wikipedia from http://en.wikipedia.org/wiki/More_(command).

The three important commands here are: q, h, and ˽.

Using the help mode

When we enter help() with no object or string value, we wind up in help mode. This uses Python's help> prompt to make it very clear that we're getting help, not entering Python statements. To go back to ordinary Python programming mode, and enter quit.

The prompt then changes back to >>> to confirm that we can go back to entering code.

Your next training mission is to experiment with the help() function and help mode before we can move on.

 

Background briefing – math and numbers


We'll review basics of Python programming before we start any of our more serious missions. If you already know a little Python, this should be a review. If you don't know any Python, this is just an overview and many details will be omitted.

If you've never done any programming before, this briefing may be a bit too brief. You might want to get a more in-depth tutorial. If you're completely new to programming, you might want to look over this page for additional tutorials: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers. For more help to start with expert Python programming, go to http://www.packtpub.com/expert-python-programming/book.

The usual culprits

Python provides the usual mix of arithmetic and comparison operators. However, there are some important wrinkles and features. Rather than assuming you're aware of them, we'll review the details.

The conventional arithmetic operators are: +, -, *, /, //, %, and **. There are two variations on division: an exact division (/) and an integer division (//). You must choose whether you want an exact, floating-point result, or an integer result:

>>> 355/113
3.1415929203539825
>>> 355//113
3
>>> 355.0/113.0
3.1415929203539825
>>> 355.0//113.0
3.0

The exact division (/) produces a float result from two integers. The integer division produces an integer result. When we use float values, we expect exact division to produce float. Even with two floating-point values, the integer division produces a rounded-down floating-point result.

We have this extra division operator to avoid having to use wordy constructs such as int(a/b) or math.floor(a/b).

Beyond conventional arithmetic, there are some additional bit fiddling operators that are available: &, |, ^, >>, <<, and ~. These operators work on integers (and sets). These are emphatically not Boolean operators; they don't work on the narrow domain of True and False. They work on the individual bits of an integer.

We'll use binary values with the 0b prefix to show what the operators do, as shown in the following code. We'll look at details of this 0b prefix later.

>>> bin(0b0101 & 0b0110)
'0b100'
>>> bin(0b0101 ^ 0b0110)
'0b11'
>>> bin(0b0101 | 0b0110)
'0b111'
>>> bin(~0b0101)
'-0b110'

The & operator does bitwise AND. The ^ operator does bitwise exclusive OR (XOR). The | operator does inclusive OR. The ~ operator is the complement of the bits. The result has many 1 bits and is shown as a negative number.

The << and >> operators are for doing left and right shifts of the bits, as shown in the following code:

>>> bin( 0b110 << 4 )
'0b1100000'
>>> bin( 0b1100000 >> 3 )
'0b1100'

It may not be obvious, but shifting left x bits is like multiplying it by 2**x, except it may operate faster. Similarly, shifting right by b bits amounts to division by 2**b.

We also have all of the usual comparison operators: <, <=, >, >=, ==, and !=.

In Python, we can combine comparison operators without including the AND operator:

>>> 7 <= 11 < 17
True
>>> 7 <= ll and 11 < 17
True

This simplification really does implement our conventional mathematical understanding of how comparisons can be written. We don't need to say 7 <= 11 and 11 < 17.

There's another comparison operator that's used in some specialized situations: is. The is operator will appear, for now, to be the same as ==. Try it. 3 is 3 and 3 == 3 seem to do the same thing. Later, when we start using the None object, we'll see the most common use for the is operator. For more advanced Python programming, there's a need to distinguish between two references to the same object (is) and two objects which claim to have the same value (==).

The ivory tower of numbers

Python gives us a variety of numbers, plus the ability to easily add new kinds of numbers. We'll focus on the built-in numbers here. Adding new kinds of numbers is the sort of thing that takes up whole chapters in more advanced books.

Python ranks the numbers into a kind of tower. At the top are numbers with fewest features. Each subclass extends that number with more and more features. We'll look at the tower from bottom up, starting with the integers that have the most features, and moving towards the complex numbers that have the least features. The following sections cover the various kinds of numbers we'll need to use.

Integer numbers

We can write integer values in base 10, 16, 8, or 2. Base 10 numbers don't need a prefix, the other bases will use a simple two-character prefix, as shown in the following snippet:

48813
0xbead 
0b1011111010101101
0o137255

We also have functions that will convert numbers into handy strings in different bases. We can use the hex(), oct(), and bin() functions to see a value in base 16, 8, or 2.

The question of integer size is common. Python integers don't have a maximum size. They're not artificially limited to 32 or 64 bits. Try this:

>>> 2**256
115792089237316195423570985008687907853269984665640564039457584007913129639936

Large numbers work. They may be a bit slow, but they work perfectly fine.

Rational numbers

Rational numbers are not commonly used. They must be imported from the standard library. We must import the fractions.Fraction class definition. It looks like this:

>>> from fractions import Fraction

Once we have the Fraction class defined, we can use it to create numbers. Let's say we were sent out to track down a missing device. Details of the device are strictly need-to-know. Since we're new agents, all that HQ will release to us is the overall size in square inches.

Here's an exact calculation of the area of a device we found. It is measured as 4⅞" multiplied by 2¼":

>>> length=4+Fraction("7/8")
>>> width=2+Fraction("1/4")
>>> length*width
Fraction(351, 32)

Okay, the area is 351/32, which is—what?—in real inches and fractions.

We can use Python's divmod() function to work this out. The divmod() function gives us a quotient and a remainder, as shown in the following code:

>>> divmod(351,32)
(10, 31)

It's about 5 × 2, so the value seems to fit within our rough approximation. We can transmit that as the proper result. If we found the right device, we'll be instructed on what to do with it. Otherwise, we might have blown the assignment.

Floating-point numbers

We can write floating-point values in common or scientific notation as follows:

3.1415926
6.22E12

The presence of the decimal point distinguishes an integer from a float.

These are ordinary double-precision floating-point numbers. It's important to remember that floating-point values are only approximations. They usually have a 64-bit implementation.

If you're using CPython, they're explicitly based on the C compiler that was shown in the sys.version startup message. We can also get information from the platform package as shown in the following code snippet:

>>> import platform
>>> platform.python_build()
('v3.3.4:7ff62415e426', 'Feb  9 2014 00:29:34')
>>> platform.python_compiler()
'GCC 4.2.1 (Apple Inc. build 5666) (dot 3)'

This tells us which compiler was used. That, in turn, can tell us what floating-point libraries were used. This may help determine which underlying mathematical libraries are in use.

Decimal numbers

We need to be careful with money. Words to live by: the accountants watching over spies are a tight-fisted bunch.

What's important is that floating-point numbers are an approximation. We can't rely on approximations when working with money. For currency, we need exact decimal values, nothing else will do. Decimal numbers can be used with the help of an extension module. We'll import the decimal.Decimal class definition to work with currency. It looks like this:

>>> from decimal import Decimal

The informant we bribed to locate the device wants to be paid 50,000 Greek Drachma for the information on the missing device. When we submit our expenses, we'll need to include everything, including the cab fare (23.50 dollars) and the expensive lunch we had to buy her (12,900 GRD).

Why wouldn't the informant accept Dollars or Euros? We don't want to know, we just want their information. Recently, Greek Drachma were trading at 247.616 per dollar.

What's the exact budget for the information? In drachma and dollars?

First, we will convert currency exact to the mil (1000 of a dollar):

>>> conversion=Decimal("247.616")
>>> conversion
Decimal('247.616')

The tab for our lunch, converted from drachma to dollars, is calculated as follows:

>>> lunch=Decimal("12900")
>>> lunch/conversion
Decimal('52.09679503747738433703799431')

What? How is that mess going to satisfy the accountants?

All those digits are a consequence of exact division: we get a lot of decimal places of precision; not all of them are really relevant. We need to formalize the idea of rounding off the value so that the government accountants will be happy. The nearest penny will do. In the Decimal method, we'll use the quantize method. The term quantize refers to rounding up, rounding down, and truncating a given value. The decimal module offers a number of quantizing rules. The default rule is ROUND_HALF_EVEN: round to the nearest value; in the case of a tie, prefer the even value. The code looks as follows:

>>> penny=Decimal('.00')
>>> (lunch/conversion).quantize(penny)
Decimal('52.10')
That's much better. How much was the bribe we needed to pay?
>>> bribe=50000
>>> (bribe/conversion).quantize(penny)
Decimal('201.93')

Notice that the division involved an integer and a decimal. Python's definition of decimal will quietly create a new decimal number from the integer so that the math will be done using decimal objects.

The cab driver charged us US Dollars. We don't need to do much of a conversion. So, we will add this amount to the final amount, as shown in the following code:

>>> cab=Decimal('23.50')
That gets us to the whole calculation: lunch plus bribe, converted, plus cab.
>>> ((lunch+bribe)/conversion).quantize(penny)+cab
Decimal('277.52')

Wait. We seem to be off by a penny. Why didn't we get 277.53 dollars as an answer?

Rounding. The basic rule is called round half up. Each individual amount (52.10 and 201.93) had a fraction of a penny value rounded up. (The more detailed values were 52.097 and 201.926.) When we computed the sum of the drachma before converting, the total didn't include the two separately rounded-up half-penny values.

We have a very fine degree of control over this. There are a number of rounding schemes, and there are a number of ways to define when and how to round. Also, some algebra may be required to see how it all fits together.

Complex numbers

We also have complex numbers in Python. They're written with two parts: a real and an imaginary value, as shown in the following code:

>>> 2+3j
(2+3j)

If we mix complex values with most other kinds of numbers, the results will be complex. The exception is decimal numbers. But why would we be mixing engineering data and currency? If any mission involves scientific and engineering data, we have a way to deal with the complex values.

Outside the numbers

Python includes a variety of data types, which aren't numbers. In the Handling text and strings section, we'll look at Python strings. We'll look at collections in Chapter 2, Acquiring Intelligence Data.

Boolean values, True and False, form their own little domain. We can extract a Boolean value from most objects using the bool() function. Here are some examples:

>>> bool(5)
True
>>> bool(0)
False
>>> bool('')
False
>>> bool(None)
False
>>> bool('word')
True

The general pattern is that most objects have a value True and a few exceptional objects have a value False. Empty collections, 0, and None have a value False. Boolean values have their own special operators: and, or, and not. These have an additional feature. Here's an example:

>>> True and 0
0
>>> False and 0
False

When we evaluate True and 0, both sides of the and operator are evaluated; the right-hand value was the result. But when we evaluated False and 0, only the left-hand side of and was evaluated. Since it was already False, there was no reason to evaluate the right-hand side.

The and and or operators are short-circuit operators. If the left side of and is False, that's sufficient and the right-hand side is ignored. If the left-hand side of or is True, that's sufficient and the right-hand side is ignored.

Python's rules for evaluation follow mathematic practice closely. Arithmetic operations have the highest priority. Comparison operators have a lower priority than arithmetic operations. The logical operators have a very low priority. This means that a+2 > b/3 or c==15 will be done in phases: first the arithmetic, then the comparison, and finally the logic.

Mathematical rules are followed by arithmetic rules. ** has a higher priority than *, /, //, or %. The + and – operators come next. When we write 2*3+4, the 2*3 operation must be performed first. The bit fiddling is even lower in priority. When you have a sequence of operations of the same priority (a+b+c), the computations are performed from left to right. If course, if there's any doubt, it's sensible to use parenthesis.

Assigning values to variables

We've been using the REPL feature of our Python toolset. In the long run, this isn't ideal. We'll be much happier writing scripts. The point behind using a computer for intelligence gathering is to automate data collection. Our scripts will require assignment to variables. It will also require explicit output and input.

We've shown the simple, obvious assignment statement in several examples previously. Note that we don't declare variables in Python. We simply assign values to variables. If the variable doesn't exist, it gets created. If the variable does exist, the previous value is replaced.

Let's look at some more sophisticated technology for creating and changing variables. We have multiple assignment statements. The following code will assign values to several variables at once:

>>> length, width = 2+Fraction(1,4), 4+Fraction(7,8)
>>> length
Fraction(9, 4)
>>> width
Fraction(39, 8)
>>> length >= width
False

We've set two variables, length and width. However, we also made a small mistake. The length isn't the larger value; we've switched the values of length and width. We can swap them very simply using a multiple assignment statement as follows:

>>> length, width = width, length
>>> length
Fraction(39, 8)
>>> width
Fraction(9, 4)

This works because the right-hand side is computed in its entirety. In this case, it's really simple. Then all of the values are broken down and assigned to the named variables. Clearly, the number of values on the right have to match the number of variables on the left or this won't work.

We also have augmented assignment statements. These couple an arithmetic operator with the assignment statement. The following code is an example of +=: using assignment augmented with addition. Here's an example of computing a sum from various bits and pieces:

>>> total= 0
>>> total += (lunch/conversion).quantize(penny)
>>> total += (bribe/conversion).quantize(penny)
>>> total += cab
>>> total
Decimal('277.53')

We don't have to write total = total +.... Instead, we can simply write total += .... It's a nice clarification of what our intent is.

All of the arithmetic operators are available as augmented assignment statements. We might have a hard time finding a use for %= or **=, but the statements are part of the language.

The idea of a nice clarification should lead to some additional thinking. For example, the variable named conversion is a perfectly opaque name. Secrecy for data is one thing: we'll look at ways to encrypt data. Obscurity through shabby processing of that data often leads to a nightmarish mess. Maybe we should have called it something that defines more clearly what it means. We'll revisit this problem of obscurity in some examples later on.

Writing scripts and seeing output

Most of our missions will involve gathering and analyzing data. We won't be creating a very sophisticated User Interface (UI). Python has tools for building websites and complex graphical user interfaces (GUIs). The complexity of those topics leads to entire books to cover GUI and web development.

We don't want to type each individual Python statement at the >>> prompt. That makes it easy to learn Python, but our goal is to create programs. In GNU/Linux parlance, our Python application programs can be called scripts. This is because Python programs fit the definition for a scripting language.

For our purposes, we'll focus on scripts that use the command-line interface (CLI) Everything we'll write will run in a simple terminal window. The advantage of this approach is speed and simplicity. We can add graphic user interfaces later. Or we can expand the essential core of a small script into a web service, once it works.

What is an application or a script? A script is simply a plain text file. We can use any text editor to create this file. A word processor is rarely a good idea, since word processors aren't good at producing plain text files.

If we're not working from the >>> REPL prompt, we'll need to explicitly display the output. We'll display output from a script using the print() function.

Here's a simple script we can use to produce a receipt for bribing (encouraging) our informant.

From decimal import Decimal:

PENNY= Decimal('.00')

grd_usd= Decimal('247.616')
lunch_grd= Decimal('12900')
bribe_grd= 50000
cab_usd= Decimal('23.50')

lunch_usd= (lunch_grd/grd_usd).quantize(PENNY)
bribe_usd= (bribe_grd/grd_usd).quantize(PENNY)

print( "Lunch", lunch_grd, "GRD", lunch_usd, "USD" )
print( "Bribe", bribe_grd, "GRD", bribe_usd, "USD" )
print( "Cab", cab_usd, "USD" )
print( "Total", lunch_usd+bribe_usd+cab_usd, "USD" )

Let's break this script down so that we can follow it. Reading a script is a lot like putting a tail on an informant. We want to see where the script goes and what it does.

First, we imported the Decimal definition. This is essential for working with currency. We defined a value, PENNY, that we'll use to round off currency calculations to the nearest penny. We used a name in all caps to make this variable distinctive. It's not an ordinary variable; we should never see it on the left-hand side of an assignment statement again in the script.

We created the currency conversion factor, and named it grd_usd. That's a name that seems meaningful than conversion in this context. Note that we also added a small suffix to our amount names. We used names such as lunch_grd, bribe_grd, and cab_usd to emphasize which currency is being used. This can help prevent head-scrambling problems.

Given the grd_usd conversion factor, we created two more variables, lunch_usd and bribe_usd, with the amounts converted to dollars and rounded to the nearest penny. If the accountants want to fiddle with the conversion factor—perhaps they can use a different bank than us spies—they can tweak the number and prepare a different receipt.

The final step was to use the print() function to write the receipt. We printed the three items we spent money on, showing the amounts in GRD and USD. We also computed the total. This will help the accountants to properly reimburse us for the mission.

We'll describe the output as primitive but acceptable. After all, they're only accountants. We'll look into pretty formatting separately.

Gathering user input

The simplest way to gather input is to copy and paste it into the script. That's what we did previously. We pasted the Greek Drachma conversion into the script: grd_usd= Decimal('247.616'). We could annotate this with a comment to help the accountants make any changes.

Additional comments come at the end of the line, after a # sign. They look like this:

grd_usd= Decimal('247.616') # Conversion from Mihalis Bank 5/15/14

This extra text is part of the application, but it doesn't actually do anything. It's a note to ourselves, our accountants, our handler, or the person who takes over our assignments when we disappear.

This kind of data line is easy to edit. But sometimes the people we work with want more flexibility. In that case, we can gather this value as input from a person. For this, we'll use the input() function.

We often break user input down into two steps like this:

        entry= input("GRD conversion: ")
        grd_usd= Decimal(entry)

The first line will write a prompt and wait for the user to enter the amount. The amount will be a string of characters, assigned to the variable entry. Python can't use the characters directly in arithmetic statements, so we need to explicitly convert them to a useful numeric type.

The second line will try to convert the user's input to a useful Decimal object. We have to emphasize the try part of this. If the user doesn't enter a string that represents valid Decimal number, there will be a major crisis. Try it.

The crisis will look like this:

>>> entry= input("GRD conversion: ")
GRD conversion: 123.%$6
>>> grd_usd= Decimal(entry)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>]

Rather than this, enter a good number. We entered 123.%$6.

The bletch starting with Traceback indicates that Python raised an exception. A crisis in Python always results in an exception being raised. Python defines a variety of exceptions to make it possible for us to write scripts that deal with these kinds of crises.

Once we've seen how to deal with crises, we can look at string data and some simple clean-up steps that can make the user's life a little easier. We can't fix their mistakes, but we can handle a few common problems that stem from trying to type numbers on a keyboard.

Handling exceptions

An exception such as decimal.InvalidOperation is raised when the Decimal class can't parse the given string to create a valid Decimal object. What can we do with this exception?

We can ignore it. In that case, our application program crashes. It stops running and the agents using it are unhappy. Not really the best approach.

Here's the basic technique for catching an exception:

    entry= input("GRD conversion: ")
    try:
        grd_usd= Decimal(entry)
    except decimal.InvalidOperation:
        print("Invalid: ", entry)

We've wrapped the Decimal() conversion and assignment in a try: statement. If every statement in the try: block works, the grd_usd variable will be set. If, on the other hand, a decimal.InvalidOperation exception is raised inside the try: block, the except clause will be processed. This writes a message and does not set the grd_usd variable.

We can handle an exception in a variety of ways. The most common kind of exception handling will clean up in the event of some failure. For example, a script that attempts to create a file might delete the useless file if an exception was raised. The problem hasn't been solved: the program still has to stop. But it can stop in a clean, pleasant way instead of a messy way.

We can also handle an exception by computing an alternate answer. We might be gathering information from a variety of web services. If one doesn't respond in time, we'll get a timeout exception. In this case, we may try an alternate web service.

In another common exception-handling case, we may reset the state of the computation so that an action can be tried again. In this case, we'll wrap the exception handler in a loop that can repeatedly ask the user for input until they provide a valid number.

These choices aren't exclusive and some handlers can perform combinations of the previous exception handlers. We'll look at the third choice, trying again, in detail.

Looping and trying again

Here's a common recipe for getting input from the user:

grd_usd= None
while grd_usd is None:
    entry= input("GRD conversion: ")
    try:
        grd_usd= Decimal(entry)
    except decimal.InvalidOperation:
        print("Invalid: ", entry)
print( grd_usd, "GRD = 1 USD" )

We'll add a tail to this and follow it around for a bit. The goal is to get a valid decimal value for our currency conversion, grd_usd. We'll initialize that variable as Python's special None object.

The while statement makes a formal declaration of our intent. We're going to execute the body of the while statement while the grd_usd variable remains set to None. Note that we're using the is operator to compare grd_usd to None. We're emphasizing a detail here: there's only one None object in Python and we're using that single instance. It's technically possible to tweak the definition of ==; we can't tweak the definition of is.

At the end of the while statement, grd_usd is None must be False; we can say grd_usd is not None. When we look at the body of the statement, we can see that only one statement sets grd_usd, so we're assured that it must be a valid Decimal object.

Within the body of the while statement, we've used our exception-handling recipe. First, we prompt and get some input, setting the entry variable. Then, inside the try statement, we attempt to convert the string to a Decimal value. If that conversion works, then grd_usd will have that Decimal object assigned. The object will not be None and the loop will terminate. Victory!

If the conversion of entry to a Decimal value fails, the exception will be raised. We'll print a message, and leave grd_usd alone. It will still have a value of None. The loop will continue until a valid value is entered.

Python has other kinds of loops, we'll get to them later in this chapter.

 

Handling text and strings


We've glossed over Python's use of string objects. Expressions such as Decimal('247.616') and input(GRD conversion: ) involve string literal values. Python gives us several ways to put strings into our programs; there's a lot of flexibility available.

Here are some examples of strings:

>>> "short"
'short'
>>> 'short'
'short'
>>> """A multiple line,
... very long string."""
'A multiple line,\nvery long string.'
>>> '''another multiple line
... very long string.'''
'another multiple line\nvery long string.'

We've used single quotes and apostrophes to create short strings. These must be complete within a single line of programming. We used triple quotes and triple apostrophes to create long strings. These strings can stretch over multiple lines of a program.

Note that Python echoes the strings back to us with a \n character to show the line break. This is called a character escape. The \ character escapes the normal meaning of n. The sequence \n doesn't mean n; \n means the often invisible newline character. Python has a number of escapes. The newline character is perhaps the most commonly used escape.

Sometimes we'll need to use characters which aren't present on our computer keyboards. For example, we might want to print one of the wide variety of Unicode special characters.

The following example works well when we know the Unicode number for a particular symbol:

>>> "\u2328"
'⌨'

The following example is better because we don't need to know the obscure code for a symbol:

>>> "\N{KEYBOARD}"
'⌨'

Converting between numbers and strings

We have two kinds of interesting string conversions: strings to numbers and numbers to strings.

We've seen functions such as Decimal() to convert a string to a number. We also have the functions: int(), float(), fractions.Fraction(), and complex(). When we have numbers that aren't in base 10, we can also use int() to convert those, as shown in the following code:

>>> int( 'dead', 16 )
57005
>>> int( '0b1101111010101101', 2 )
57005

We can create strings from numbers too. We can use functions such as hex(), oct(), and bin() to create strings in base 16, 8, and 2. We also have the str() function, which is the most general-purpose function to convert any Python object into a string of some kind.

More valuable than these is the format() method of a string. This performs a variety of value-to-string conversions. It uses a conversion format specification or template string to define what the resulting string will look like.

Here's an example of using format() to convert several values into a single string. It uses a rather complex format specification string:

 >>> "{0:12s} {1:6.2f} USD {2:8.0f} GRD".format( "lunch", lunch_usd, lunch_grd )
'lunch         52.10 USD    12900 GRD'

The format string has three conversion specifications: {0:12s}, {1:6.2f}, and {2:8.0f}. It also has some literal text, mostly spaces, but USD and GRD are part of the background literal text into which the data will be merged.

Each conversion specification has two parts: the item to convert and the format for that item. These two parts separated by a : inside {}. We'll look at each conversion:

  • The item 0 is converted using the 12s format. This format produces a twelve-position string. The string lunch was padded out to 12 positions.

  • The item 1 is converted using the 6.2f format. This format produces a six-position string. There will be two positions to the right of the decimal point. The value of lunch_usd was formatted using this.

  • The item 2 is converted using an 8.0f format. This format produces an eight-position string with no positions to the right of the decimal point. The value of lunch_grd was formatted using this specification.

We can do something like the following to improve our receipt:

receipt_1 = "{0:12s}              {1:6.2f} USD"
receipt_2 = "{0:12s} {1:8.0f} GRD {2:6.2f} USD"
print( receipt_2.format("Lunch", lunch_grd, lunch_usd) )
print( receipt_2.format("Bribe", bribe_grd, bribe_usd) )
print( receipt_1.format("Cab", cab_usd) )
print( receipt_1.format("Total", lunch_usd+bribe_usd+cab_usd) )

We've used two parallel format specifications. The receipt_1 string can be used to format a label and a single dollar value. The receipt_2 string can be used to format a label and two numeric values: one in dollars and the other in Greek Drachma.

This makes a better-looking receipt. That should keep the accountants off our back and let us focus on the real work: working on data files and folders.

Parsing strings

String objects can also be decomposed or parsed into substrings. We could easily write an entire chapter on all the various parsing methods that string objects offer. A common transformation is to strip extraneous whitespace from the beginning and end of a string. The idea is to remove spaces and tabs (and a few other nonobvious characters). It looks like this:

entry= input("GRD conversion: ").strip()

We've applied the input() function to get a string from the user. Then we've applied the strip() method of that string object to create a new string, stripped bare of whitespace characters. We can try it from the >>> prompt like this:

>>> "   123.45     ".strip()
'123.45'

This shows how a string with junk was pared down to the essentials. This can simplify a user's life; a few extra spaces won't be a problem.

Another transformation might be to split a string into pieces. Here's just one of the many techniques available:

>>> amount, space, currency = "123.45 USD".partition(" ")
>>> amount
'123.45'
>>> space
' '
>>> currency
'USD'

Let's look at this in detail. First, it's a multiple-assignment statement, where three variables are going to be set: amount, space, and currency.

The expression, "123.45 USD".partition(" "), works by applying the partition() method to a literal string value. We're going to partition the string on the space character. The partition() method returns three things: the substring in front of the partition, the partition character, and the substring after the partition.

The actual partition variable may also be assigned an empty string, ''. Try this:

amount, space, currency = "word".partition(" ") 

What are the values for amount, space, and currency?

If you use help(str), you'll see all of the various kinds of things a string can do. The names that have __ around them map to Python operators. __add__(), for example, is how the + operator is implemented.

 

Organizing our software


Python gives us a number of ways to organize software into conceptual units. Long, sprawling scripts are hard to read, repair, or extend. Python offers us packages, modules, classes, and functions. We'll see different organizing techniques throughout our agent training. We'll start with function definition.

We've used a number of Python's built-in functions in the previous sections. Defining our own function is something we do with the def statement. A function definition allows us to summarize (and in some cases generalize) some processing. Here's a simple function we can use to get a decimal value from a user:

def get_decimal(prompt):
    value= None
    while value is None:
        entry= input(prompt)
        try:
            value= Decimal(entry)
        except decimal.InvalidOperation:
            print("Invalid: ", entry)
    return value

This follows the design we showed previously, packaged as a separate function. This function will return a proper Decimal object: the value of the value variable. We can use our get_decimal() function like this:

grd_usd= get_decimal("GRD conversion: ")

Python allows a great deal of variability in how argument values are supplied to functions. One common technique is to have an optional parameter, which can be provided using a keyword argument. The print() function has this feature, we can name a file by providing a keyword argument value.

import sys
print("Error", file=sys.stderr)

If we don't provide the file parameter, the sys.stdout file is used by default.

We can do this in our own functions with the following syntax:

def report( grd_usd, target=sys.stdout ):
    lunch_grd= Decimal('12900')
    bribe_grd= 50000
    cab_usd= Decimal('23.50')

    lunch_usd= (lunch_grd/grd_usd).quantize(PENNY)
    bribe_usd= (bribe_grd/grd_usd).quantize(PENNY)

    receipt_1 = "{0:12s}              {1:6.2f} USD"
    receipt_2 = "{0:12s} {1:8.0f} GRD {2:6.2f} USD"
    print( receipt_2.format("Lunch", lunch_grd, lunch_usd), file=target )
    print( receipt_2.format("Bribe", bribe_grd, bribe_usd), file=target )
    print( receipt_1.format("Cab", cab_usd), file=target )
    print( receipt_1.format("Total", lunch_usd+bribe_usd+cab_usd), file=target )

We defined our report function to have two parameters. The grd_usd parameter is required. The target parameter has a default value, so it's optional.

We're also using a global variable, PENNY. This was something we set outside the function. The value is usable inside the function.

The four print() functions provide the file parameter using the keyword syntax: file=target. If we provided a value for the target parameter, that will be used; if we did not provide a value for target, the default value of the sys.stdout file will be used. We can use this function in several ways. Here's one version:

rate= get_decimal("GRD conversion: ")
print(rate, "GRD = 1 USD")
report(rate)

We provided the grd_usd parameter value positionally: it's first. We didn't provide a value for the target parameter; the default value will be used.

Here's another version:

rate= get_decimal("GRD conversion: ")
print(rate, "GRD = 1 USD", file=sys.stdout)
report(grd_usd=rate, target=sys.stdout)

In this example, we used the keyword parameter syntax for both the grd_usd and target parameters. Yes, the target parameter value recapitulated the default value. We'll look at how to create our own files in the next section.

 

Working with files and folders


Our computer is full of files. One of the most important features of our operating system is the way it handles files and devices. Python gives us an outstanding level of access to various kinds of files.

However, we've got to draw a few lines. All files consist of bytes. This is a reductionist view that's not always helpful. Sometimes those bytes represent Unicode characters which makes reading the file is relatively easy. Sometimes those bytes represent more complex objects which makes reading the file may be quite difficult.

Pragmatically, files come in a wide variety of physical formats. Our various desktop applications (word processors, spread sheets, and so on) all have unique formats for the data. Some of those physical formats are proprietary products, and this makes them exceptionally difficult to work with. The contents are obscure (not secure) and the cost of ferreting out the information can be extraordinary. We can always resort to examining the low-level bytes and recovering information that way.

Many applications work with files in widely standardized formats. This makes our life much simpler. The format may be complex, but the fact that it conforms to a standard means that we can recover all of the data. We'll look at a number of standardized formats for subsequent missions. For now, we need to get the basics under our belts.

Creating a file

We'll start by creating a text file that we can work with. There are several interesting aspects to working with files. We'll focus on the following two aspects:

  • Creating a file object. The file object is the Python view of an operating system resource. It's actually rather complex, but we can access it very easily.

  • Using the file context. A file has a particular life: open, read or write, and then close. To be sure that we close the file and properly disentangle the OS resources from the Python object, we're usually happiest using a file as a context manager. Using a with statement guarantees that the file is properly closed.

Our general template, with open("message1.txt", "w") as target, for creating a file looks like this:

    print( "Message to HQ", file=target )
    print( "Device Size 10 31/32", file=target )

We'll open the file with the open() function. In this case, the file is opened in write mode. We've used the print() function to write some data into the file.

Once the program finishes the indented context of the with statement, the file is properly closed and the OS resources are released. We don't need to explicitly close the file object.

We can also use something like this to create our file:

text="""Message to HQ\n Device Size 10 31/32\n"""
with open("message1.txt", "w") as target:
    target.write(text)

Note the important difference here. The print() function automatically ends each line with a \n character. The write() method of a file object doesn't add anything.

In many cases, we may have more complex physical formatting for a file. We'll look at JSON or CSV files in a later section. We'll also look at reading and writing image files in Chapter 3, Encoding Secret Messages with Steganography.

Reading a file

Our general template for reading a file looks like this:

with open("message1.txt", "r") as source:
    text= source.read()
print( text )

This will create the file object, but it will be in read mode. If the file doesn't exist, we'll get an exception. The read() function will slurp the entire file into a single block of text. Once we're done reading the content of the file, we're also done with the with context. The file can be closed and the resources can be released. The text variable that we created will have the file's contents ready for further processing.

In many cases, we want to process the lines of the text separately. For this, Python gives us the for loop. This statement interacts with files to iterate through each line of the file, as shown in the following code:

with open("message1.txt", "r") as source:
    for line in source:
        print(line)

The output looks a bit odd, doesn't it?

It's double-spaced because each line read from the file contains a \n character at the end. The print() function automatically includes a \n character. This leads to double-spaced output.

We have two candidate fixes. We can tell the print() function not to include a \n character. For example, print(line, end="") does this.

A slightly better fix is to use the rstrip() method to remove the trailing whitespace from the right-hand end of line. This is slightly better because it's something we'll do often in a number of contexts. Attempting to suppress the output of the extra \n character in the print() function is too specialized to this one situation.

In some cases, we may have files where we need to filter the lines, looking for particular patterns. We might have a loop that includes conditional processing via the if statement, as shown in the following code:

with open("message1.txt", "r") as source:
    for line in source:
        junk1, keyword, size= line.rstrip().partition("Size")
        if keyword != '':
            print( size )

This shows a typical structure for text processing programs. First, we open the file via a with statement context; this assures us that the file will be closed properly no matter what happens.

We use the for statement to iterate through all lines of the file. Each line has a two-step process: the rstrip() method removes trailing whitespace, the partition() method breaks the line around the keyword Size.

The if statement defines a condition (keyword != '') and some processing that's done only if the condition is True. If the condition is False (the value of keyword is ''), the indented body of the if statement is silently skipped.

The assignment and if statements form the body of the for statement. These two statements are executed once for every line in the file. When we get to the end of the for statement, we can be assured that all lines were processed.

We have to note that we can create an exception to the usual for all lines assumption about processing files with the for statement. We can use the break statement to exit early from the loop, breaking the usual assumption. We'd prefer to avoid the break statement, making it easy to see that a for statement works for all lines of a file.

At the end of the for statement, we're done processing the file. We're done with the with context, too. The file will be closed.

Defining more complex logical conditions

What if we have more patterns than what we're looking for? What if we're processing more complex data?

Let's say we've got something like this in a file:

Message to Field Agent 006 1/2 
Proceed to Rendezvous FM16uu62
Authorization to Pay $250 USD

We're looking for two keywords: Rendezvous and Pay. Python gives us the elif clause as part of the if statement. This clause provides a tidy way to handle multiple conditions gracefully. Here's a script to parse a message to us from the headquarters:

amount= None
location= None
with open("message2.txt", "r") as source:
    for line in source:
        clean= line.lower().rstrip()
        junk, pay, pay_data= clean.partition("pay")
        junk, meet, meet_data= clean.partition("rendezvous")
        if pay != '':
            amount= pay_data
        elif meet != '':
            location= meet_data
        else:
            pass # ignore this line 
print("Budget", amount, "Meet", location)

We're searching the contents in the file for two pieces of information: the rendezvous location and the amount we can use to bribe our contact. In effect, we're going to summarize this file down to two short facts, discarding the parts we don't care about.

As with the previous examples, we're using a with statement to create a processing context. We're also using the for statement to iterate through all lines of the file.

We've used a two-step process to clean each line. First, we used the lower() method to create a string in lowercase. Then we used the rstrip() method to remove any trailing whitespace from the line.

We applied the partition() method to the cleaned line twice. One partition looked for pay and the other partition looked for rendezvous. If the line could be partitioned on pay, the pay variable (and pay_data) would have values not equal to a zero-length string. If the line could be partitioned on rendezvous, then the meet variable (and meet_data) would have values not equal to a zero-length string. The else, if is abbreviated elif in Python.

If none of the previous conditions are true, we don't need to do anything. We don't need an else: clause. But we decided to include the else: clause in case we later needed to add some processing. For now, there's nothing more to do. In Python, the pass statement does nothing. It's a syntactic placeholder; a thing to write when we must write something.

 

Solving problems – recovering a lost password


We'll apply many of our techniques to writing a program to help us poke around inside a locked ZIP file. It's important to note that any competent encryption scheme doesn't encrypt a password. Passwords are, at worst, reduced to a hash value. When someone enters a password, the hash values are compared. The original password remains essentially unrecoverable except by guessing.

We'll look at a kind of brute-force password recovery scheme. It will simply try all of the words in a dictionary. More elaborate guessing schemes will use dictionary words and punctuation to form longer and longer candidate passwords. Even more elaborate guessing will include leet speak replacements of characters. For example, using 1337 [email protected] instead of leet speak.

Before we look into how ZIP files work, we'll have to find a usable word corpus. A common stand-in for a corpus is a spell-check dictionary. For GNU/Linux or Mac OS X computers, there are several places a dictionary can be found. Three common places are: /usr/dict/words, /usr/share/dict/words, or possibly /usr/share/myspell/dicts.

Windows agents may have to search around a bit for similar dictionary resources. Look in %AppData%\Microsoft\Spelling\EN as a possible location. The dictionaries are often a .dic file. There may also be an associated .aff (affix rules) file, with additional rules for building words from the stem words (or lemmas) in the .dic file.

If we can't track down a usable word corpus, it may be best to install a standalone spell checking program, along with its dictionaries. Programs such as aspell, ispell, Hunspell, Open Office, and LibreOffice contain extensive collections of spelling dictionaries for a variety of languages.

There are other ways to get various word corpora. One way is to search all of the text files for all of the words in all of those files. The words we used to create a password may be reflected in words we actually use in other files.

Another good approach is to use the Python Natural Language Toolkit (NLTK), which has a number of resources for handling natural language processing. As this manual was going to press, a version has been released which works with Python3. See https://pypi.python.org/pypi/nltk. This library provides lexicons, several wordlist corpora, and word stemming tools that are far better than simplistic spell-checking dictionaries.

Your mission is to locate a dictionary on your computer. If you can't find one, then download a good spell-check program and use its dictionary. A web search for web2 (Webster's Second International) may turn up a usable corpus.

Reading a word corpus

The first thing we need to do is read our spell-check corpus. We'll call it a corpus—a body of words—not a dictionary. The examples will be based on web2 (Webster's Second International) all 234,936 words worth. This is generally available in BSD Unix and Mac OS X.

Here's a typical script that will examine a corpus:

count= 0
corpus_file = "/usr/share/dict/words" 
with open( corpus_file ) as corpus:
    for line in corpus:
        word= line.strip()
        if len(word) == 10:
            print(word)
            count += 1
print( count )

We've opened the corpus file and read all of the lines. The word was located by stripping whitespace from the line; this removes the trailing \n character. An if statement was used to filter the 10-letter words. There are 30,878 of those, from abalienate to Zyzzogeton.

This little script isn't really part of any larger application. It's a kind of technology spike—something we're using to nail down a detail. When writing little scripts like this, we'll often skip careful design of classes or functions and just slap some Python statements into a file.

In POSIX-compliant OSes, we can do two more things to make a script easy to work with. First, we can add a special comment on the very first line of the file to help the OS figure out what to do with it. The line looks like this:

#!/usr/bin/env python3

This tells the OS how to handle the script. Specifically, it tells the OS to use the env program. The env program will then locate our installation of Python 3. Responsibility will be handed off to the python3 program.

The second step is to mark the script as executable. We use the OS command, chmod +x some_file.py, to mark a Python file as an executable script.

If we've done these two steps, we can execute a script by simply typing its name at the command prompt.

In Windows, the file extension (.py) is associated with the Python program. There is an Advanced Settings panel that defines these file associations. When you installed Python, the association was built by the installer. This means that you can enter the name of a Python script and Windows will search through the directories named in your PATH value and execute that script properly.

Reading a ZIP archive

We'll use Python's zipfile module to work with a ZIP archive. This means we'll need to use import zipfile before we can do anything else. Since a ZIP archive contains multiple files, we'll often want to get a listing of the available files in the archive. Here's how we can survey an archive:

import zipfile
with zipfile.ZipFile( "demo.zip", "r" ) as archive:
    archive.printdir()

We've opened the archive, creating a file processing context. We then used the archive's printdir() method to dump the members of the archive.

We can't, however, extract any of the files because the ZIP archive was encrypted and we lost the password. Here's a script that will try to read the first member:

import zipfile
with zipfile.ZipFile( "demo.zip", "r" ) as archive:
    archive.printdir()
    first = archive.infolist()[0]
    with archive.open(first) as member:
        text= member.read()
        print( text )

We've created a file processing context using the open archive. We used the infolist() method to get information on each member. The archive.infolist()[0] statement will pick item zero from the list, that is, the first item.

We tried to create a file processing context for this specific member. Instead of seeing the content of the member, we get an exception. The details will vary, but your exception message will look like this:

RuntimeError: File <zipfile.ZipInfo object at 0x1007e78e8> is encrypted, password required for extraction

The hexadecimal number (0x1007e78e8) may not match your output, but you'll still get an error trying to read an encrypted ZIP file.

Using brute-force search

To recover the files, we'll need to resort to brute-force search for a workable password. This means inserting our corpora reading loop into our archive processing context. It's a bit of flashy copy-and-paste that leads to a script like the following:

import zipfile
import zlib
corpus_file = "/usr/share/dict/words"

with zipfile.ZipFile( "demo.zip", "r" ) as archive:
    first = archive.infolist()[0]
    print( "Reading", first.filename )
    with open( corpus_file ) as corpus:
        for line in corpus:
            word= line.strip().encode("ASCII")
            try:
                with archive.open(first, 'r', pwd=word) as member:
                    text= member.read()
                print( "Password", word )
                print( text )
                break
            except (RuntimeError, zlib.error, zipfile.BadZipFile):
                pass

We've imported two libraries: zipfile as well as zlib. We added zlib because it turns out that we'll sometimes see zlib.error exceptions when guessing passwords. We created a context for our open archive file. We used the infolist() method to get names of members and fetched just the first file from that list. If we can read one file, we can read them all.

Then we opened our corpus file, and created a file processing context for that file. For each line in the corpora, we used two methods of the line: the strip() method will remove the trailing "\n", and the encode("ASCII") method will transform the line from Unicode characters to ASCII bytes. We need this because ZIP library passwords are ASCII bytes, not proper Unicode character strings.

The try: block attempts to open and read the first member. We created a file processing context for this member within the archive. We tried to read the member. If anything goes wrong while we are trying to read the encrypted member, an exception will be raised. The usual culprit, of course, is attempting to read the member with the wrong password.

If everything works well, then we guessed the correct password. We can print the recovered password, as well as the text of the member as a confirmation.

Note that we've used a break statement to end the corpora processing for loop. This changes the for loop's semantics from for all words to there exists a word. The break statement means the loop ends as soon as a valid password is found. No further words in the corpus need to be processed.

We've listed three kinds of exceptions that might be raised from attempting to use a bad password. It's not obvious why different kinds of exceptions may be raised by wrong passwords. But it's easy to run some experiments to confirm that a variety of different exceptions really are raised by a common underlying problem.

 

Summary


In this chapter, we saw the basics of our espionage toolkit: Python and our text editor of choice. We've worked with Python to manipulate numbers, strings, and files. We saw a number of Python statements: assignment, while, for, if, elif, break, and def. We saw how an expression (such as print("hello world")) can be used as a Python statement.

We also looked at the Python API for processing a ZIP file. We saw how Python works with popular file-archive formats. We even saw how to use a simple corpus of words to recover a simple password.

Now that we have the basics, we're ready for more advanced missions. The next thing we've got to do is start using the World Wide Web (WWW) to gather information and carry it back to our computer.

About the Author

Latest Reviews

(2 reviews total)
Good clear contents and I look forward to make a monthly membership payment
Still need to read this one

Recommended For You

Python for Secret Agents - Volume II

Gather, analyze, and decode data to reveal hidden facts using Python, the perfect tool for all aspiring secret agents

Mastering Object-Oriented Python - Second Edition

Gain comprehensive insights into programming practices, and code portability and reuse to build flexible and maintainable apps using object-oriented principles

By Steven F. Lott