Python for Secret Agents - Volume II

Chapter 1. New Missions – New Tools

The espionage job is to gather and analyze data. This requires us to use computers and software tools.

However, a secret agent's job is not limited to collecting data. It involves processing, filtering, and summarizing data, and also involves confirming the data and assuring that it contains meaningful and actionable information.

Any aspiring agent would do well to study the history of the World War II English secret agent, code-named Garbo. This is an inspiring and informative story of how secret agents operated in war time.

We're going to look at a variety of complex missions, all of which will involve Python 3 to collect, analyze, summarize, and present data. Due to our previous successes, we've been asked to expand our role in a number of ways.

HQ's briefings are going to help agents make some technology upgrades. We're going to locate and download new tools for new missions that we're going to be tackling. While we're always told that a good agent doesn't speculate, the most likely reason for new tools is a new kind of mission and dealing with new kinds of data or new sources. The details will be provided in the official briefings.

Field agents are going to be encouraged to branch out into new modes of data acquisition. Internet of Things leads to a number of interesting sources of data. HQ has identified some sources that will push the field agents in new directions. We'll be asked to push the edge of the envelope.

We'll look at the following topics:

Tool upgrades, in general. Then, we'll upgrade Python to the latest stable version. We'll also upgrade the pip utility so that we can download more tools.
Reviewing the Python language. This will only be a quick summary.
Our first real mission will be an upgrade to the Beautiful Soup package. This will help us in gathering information from HTML pages.
After upgrading Beautiful Soup, we'll use this package to gather live data from a web site.
We'll do a sequence of installations in order to prepare our toolkit for later missions.
In order to build our own gadgets, we'll have to install the Arduino IDE.

This will give us the tools for a number of data gathering and analytical missions.

Background briefing on tools

The organization responsible for tools and technology is affectionately known as The Puzzle Palace. They have provided some suggestions on what we'll need for the missions that we've been assigned. We'll start with an overview of the state of art in Python tools that are handed down from one of the puzzle solvers.

Some agents have already upgraded to Python 3.4. However, not all agents have done this. It's imperative that we use the latest and greatest tools.

There are four good reasons for this, as follows:

Features: Python 3.4 adds a number of additional library features that we can use. The list of features is available at https://docs.python.org/3/whatsnew/3.4.html.
Performance: Each new version is generally a bit faster than the previous version of Python.
Security: While Python doesn't have any large security holes, there are new security changes in Python.
Housecleaning: There are a number of rarely used features that were and have been removed.

Some agents may want to start looking at Python 3.5. This release is anticipated to include some optional features to provide data type hints. We'll look at this in a few specific cases as we go forward with the mission briefings. The type-analysis features can lead to improvements in the quality of the Python programming that an agent creates. The puzzle palace report is based on intelligence gathered at PyCon 2015 in Montreal, Canada. Agents are advised to follow the Python Enhancement Proposals (PEP) closely. Refer to https://www.python.org/dev/peps/.

We'll focus on Python 3.4. For any agent who hasn't upgraded to Python 3.4.3, we'll look at the best way to approach this.

If you're comfortable with working on your own, you can try to move further and download and install Python 3.5. Here, the warning is that it's very new and it may not be quite as robust as the Python version 3.4. Refer to PEP 478 (https://www.python.org/dev/peps/pep-0478/) for more information about this release.

Doing a Python upgrade

It's important to consider each major release of Python as an add-on and not a replacement. Any release of Python 2 should be left in place. Most field agents will have several side-by-side versions of Python on their computers. The following are the two common scenarios:

The OS uses Python 2. Mac OS X and Linux computers require Python 2; this is the default version of Python that's found when we enter python at the command prompt. We have to leave this in place.
We might also have an older Python 3, which we used for the previous missions. We don't want to remove this until we're sure that we've got everything in place in order to work with Python 3.4.

We have to distinguish between the major, minor, and micro versions of Python. Python 3.4.3 and 3.4.2 have the same minor version (3.4). We can replace the micro version 3.4.2 with 3.4.3 without a second thought; they're always compatible with each other. However, we don't treat the minor versions quite so casually. We often want to leave 3.3 in place.

Generally, we do a field upgrade as shown in the following:

Download the installer that is appropriate for the OS and Python version. Start at this URL: https://www.python.org/downloads/. The web server can usually identify your computer's OS and suggest the appropriate download with a big, friendly, yellow button. Mac OS X agents will notice that we now get a .pkg (package) file instead of a .dmg (disk image) containing .pkg. This is a nice simplification.
When installing a new minor version, make sure to install in a new directory: keep 3.3 separate from 3.4. When installing a new micro version, replace any existing installation; replace 3.4.2 with 3.4.3.
- For Mac OS X and Linux, the installers will generally use names that include python3.4 so that the minor versions are kept separate and the micro versions replace each other.
- For Windows, we have to make sure we use a distinct directory name based on the minor version number. For example, we want to install all new 3.4.x micro versions in C:\Python34. If we want to experiment with the Python 3.5 minor version, it would go in C:\Python35.
Tweak the PATH environment setting to choose the default Python.
- This information is generally in our ~/.bash_profile file. In many cases, the Python installer will update this file in order to assure that the newest Python is at the beginning of the string of directories that are listed in the PATH setting. This file is generally used when we log in for the first time. We can either log out and log back in again, or restart the terminal tool, or we can use the source ~/.bash_profile command to force the shell to refresh its environment.
- For Windows, we must update the advanced system settings to tweak the value of the PATH environment variable. In some cases, this value has a huge list of paths; we'll need to copy the string and paste it in a text editor to make the change. We can then copy it from the text editor and paste it back in the environment variable setting.
After upgrading Python, use pip3.4 (or easy_install-3.4) to add the additional packages that we need. We'll look at some specific packages in mission briefings. We'll start by adding any packages that we use frequently.

At this point, we should be able to confirm that our basic toolset works. Linux and Mac OS agents can use the following command:

MacBookPro-SLott:Code slott$ python3.4

This should confirm that we've downloaded and installed Python and made it a part of our OS settings. The greeting will show which micro version of Python 3.4 have we installed.

For Windows, the command's name is usually just python. It would look similar to the following:

C:\> python

The Mac OS X interaction should include the version; it will look similar to the following code:

MacBookPro-SLott:NavTools-1.2 slott$ python3.4
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)

We've entered the python3.4 command. This shows us that things are working very nicely. We have Python 3.4.3 successfully installed.

We don't want to make a habit of using the python or python3 commands in order to run Python from the command line. These names are too generic and we could accidentally use Python 3.3 or Python 3.5, depending on what we have installed. We need to be intentional about using Python3.4.

Preliminary mission to upgrade pip

The first time that we try to use pip3.4, we may see an interaction as shown in the following:

MacBookPro-SLott:Code slott$ pip3.4 install anything
You are using pip version 6.0.8, however version 7.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

The version numbers may be slightly different; this is not too surprising. The packaged version of pip isn't always the latest and greatest version. Once we've installed the Python package, we can upgrade pip3.4 to the recent release. We'll use pip to upgrade itself.

It looks similar to the following code:

MacBookPro-SLott:Code slott$ pip3.4 install --upgrade pip
You are using pip version 6.0.8, however version 7.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting pip from https://pypi.python.org/packages/py2.py3/p/pip/pip-7.0.3-py2.py3-none-any.whl#md5=6950e1d775fea7ea50af690f72589dbd
  Downloading pip-7.0.3-py2.py3-none-any.whl (1.1MB)
    100% |################################| 1.1MB 398kB/s 
Installing collected packages: pip
  Found existing installation: pip 6.0.8
    Uninstalling pip-6.0.8:
      Successfully uninstalled pip-6.0.8

Successfully installed pip-7.0.3

We've run the pip installer to upgrade pip. We're shown some details about the files that are downloaded and new is version installed. We were able to do this with a simple pip3.4 under Mac OS X.

Some packages will require system privileges that are available via the sudo command. While it's true that a few packages don't require system privileges, it's easy to assume that privileges are always required. For Windows, of course, we don't use sudo at all.

On Mac OS X, we'll often need to use sudo -H instead of simply using sudo. This option will make sure that the proper HOME environment variable is used to manage a cache directory.

Note that your actual results may differ from this example, depending on how out-of-date your copy of pip turns out to be. This pip install --upgrade pip is a pretty frequent operation as the features advance.

Background briefing: review of the Python language

Before moving on to our first mission, we'll review some essentials of the Python language, and the ways in which we'll use it to gather and disseminate data. We'll start by reviewing the interactive use of Python to do some data manipulation. Then we'll look at statements and script files.

When we start Python from the Terminal tool or the command line, we'll see an interaction that starts as shown in the following:

MacBookPro-SLott:Code slott$ python3.4
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

The >>> prompt is Python's read-eval-print loop (REPL) that is waiting for us to enter a statement. If we use Python's development environment, IDLE, we'll also see this >>> prompt.

One of the simplest kinds of statements is a single expression. We can, for example, enter an arithmetic expression. The Read Eval Print Loop (REPL) will print the result automatically. Here's an example of simple math:

>>> 355/113
3.1415929203539825

We entered an expression statement and Python printed the resulting object. This gives us a way to explore the language. We can enter things and see the results, allowing us to experiment with new concepts.

Python offers us a number of different types of objects to work with. The first example showed integer objects, 355 and 113, as well as a floating-point result object, 3.1415929203539825.

In addition to integers and floats, we also have exact complex numbers. With the standard library, we can introduce decimal and fraction values using the decimal or fractions modules. Python can coerce values between the various types. If we have mixed values on either side of an operator, one of the values will be pushed up the numeric tower so that both operands have the same type. This means that integers can be promoted up to float and float can be promoted up to complex if necessary.

Python gives us a variety of operators. The common arithmetic operators are +, -, *, /, //, %, and **. These implement addition, subtraction, multiplication, true division, floor division, modulus, and raising to a power. The true division, /, will coerce integers to floating-point so that the answer is exact. The floor division, //, provides rounded-down answers, even with floating-point operands.

We also have some bit-fiddling operators: ~, &, |, ^, <<, and >>. These implement unary bitwise inversion, and, or, exclusive or, shift left, and shift right. These work with individual bits in a number. They're not logical operators at all.

What about more advanced math? We'll need to import libraries if we need more sophisticated features. For example, if we need to compute a square root, we'll need to import the math module, as follows:

>>> import math
>>> p= math.sqrt(7+math.sqrt(6+math.sqrt(5)))

Importing the math module creates a new object, math. This object is a kind of namespace that contains useful functions and constants. We'll use this import technique frequently to add features that we need to create useful software.

Using variables to save results

We can put a label on an object using the assignment statement. We often describe this as assigning an object to a variable; however, it's more like assigning a symbolic label to an object. The variable name (or label) must follow a specific set of syntax rules. It has to begin with a letter and can include any combination of letters, digits, and _ characters. We'll often use simple words such as x, n, samples, and data. We can use longer_names where this adds clarity.

Using variables allows us to build up results in steps by assigning names to intermediate results. Here's an example:

>>> n = 355
>>> d = 113
>>> r = n/d
>>> result = "ratio: {0:.6f}".format(r)
>>> result
'ratio: 3.141593'

We assigned the n name to the 355 integer; then we assigned the d name to the 113 integer. Then we assigned the ratio to another variable, r.

We used the format() method for strings to create a new string that we assigned to the variable named result. The format() method starts with a format specification and replace {} with formatted versions of the argument values. In the {}'s object, we requested item 0 from the collection of arguments. Since Python's indexes always start from zero, this will be the first argument value. We used a format specification of .6f to show a floating-point value (f) with six digits to the right of the decimal point (.6). This formatted number was interpolated into the overall string and the resulting string was given the name result.

The last expression in the sequence of statements, result, is very simple. The result of this trivial expression is the value of the variable. It's a string that the REPL prints for us. We can use a similar technique to print the values of intermediate results such as the r variable. We'll often make heavy use of intermediate variables in order to expose the details of a calculation.

Using the sequence collections: strings

Python strings are a sequence of Unicode characters. We have a variety of quoting rules for strings. Here are two examples:

>>> 'String with " inside'
'String with " inside'
>>> "String's methods"
"String's methods"

We can either use quotes or apostrophes to delimit a string. In the likely event that a string contains both quotes and apostrophes, we can use a \' or \" to embed some punctuation; this is called an escape sequence. The initial \ escapes from the normal meaning of the next character. The following is an example showing the complicated quotes and escapes:

>>> "I said, \"Don't touch.\""
'I said, "Don\'t touch."'

We used one set of quotes to enter the string. We used the escaped quotes in the string. Python responded with its preferred syntax; the canonical form for a string will generally use apostrophes to delimit the string overall.

Another kind of string that we'll encounter frequently is a byte string. Unlike a normal string that uses all the available Unicode characters, a byte string is limited to single-byte values. These can be shown using hexadecimal numeric codes, or – for 96 of the available bytes values – an ASCII character instead of a numeric value.

Here are two examples of byte strings:

>>> b'\x08\x09\x0a\x0c\x0d\x0e\x0f'
b'\x08\t\n\x0c\r\x0e\x0f'
>>> b'\x41\x53\x43\x49\x49'
b'ASCII'

In the first example, we provided hexadecimal values using the \xnn syntax for each byte. The prefix of \x means that the following values will be in base 16. We write base 16 values using the digits 0-9 along with the letters a-f. We provide seven values for \x08 to \x0f. Python replies using a canonical notation; our input follows more relaxed rules than those of Python's output. The canonical syntax is different for three important byte values: the tab character, \x08 can also be entered as \t. The newline character is most commonly entered as \n rather than \x0a. Finally, the carriage return character, \r, is shorter than \x0d.

In the second example, we also provided some hexadecimal values that overlap with some of the ASCII characters. Python's canonical form shows the ASCII characters instead of the hexadecimal values. This demonstrates that, for some byte values, ASCII characters are a handy shorthand.

In some applications, we'll have trouble telling a Unicode string, 'hello', from a byte string, b'hello'. We can add a u'hello' prefix in order to clearly state that this is a string of Unicode characters and not a string of bytes.

As a string is a collection of individual Unicode characters, we can extract the characters from a string using the character's index positions. Here's a number of examples:

>>> word = 'retackling'
>>> word[0]
'r'
>>> word[-1]
'g'
>>> word[2:6]
'tack'
>>> word[-3:]
'ing'

We've created a string, which is a sequence object. Sequence objects have items that can be addressed by their position or index. In position 0, we see the first item in the sequence, the 'r' character.

Sequences can also be indexed from the right to left using negative numbers. Position -1 is the last (rightmost) item in a sequence. Index position -2 is next-to-rightmost.

We can also extract a slice from a sequence. This is a new sequence that is copied from the original sequence. When we take items in positions 2 to 6, we get four characters with index values 2, 3, 4, and 5. Note that a slice includes the first position and never includes the last specified position, it's an upto but not including rule. Mathematicians call it a half-open interval and write it as [2, 6) or sometimes [2, 6[. We can use the following set comprehension rule to understand how the interval works:

All of the sequence collections allow us to count occurrences of an item and location the index of an item. The following are some examples that show the method syntax and the two universal methods that apply to sequences:

>>> word.count('a')
1
>>> word.index('t')
2
>>> word.index('z') 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

We've counted the number of items that match a particular value. We've also asked for the position of a given letter. This returns a numeric value for the index of the item equal to 't'.

String sequences have dozens of other methods to create new strings in various ways. We can do a large number of sophisticated manipulations.

Note that a string is an immutable object. We can't replace a character in a string. We can only build new strings from the old strings.

Using other common sequences: tuples and lists

We can create two other common kinds of sequences: the list and the tuple. A tuple is a fixed-length sequence of items. We often use tuples for simple structures such as pairs (latitude, longitude) or triples (r, g, b). We write a literal tuple by enclosing the items in ()s. It looks as shown in the following:

>>> ultramarine_blue = (63, 38, 191)

We've create a three-tuple or triple with some RGB values that comprise a color.

Python's assignment statement can tease a tuple into its individual items. Here's an example:

>>> red, green, blue = ultramarine_blue
>>> red
63
>>> blue
191

This multiple-variable assignment works well with tuples as a tuple has a fixed size. We can also address individual items of a tuple with expressions such as ultramarine_blue[0]. Slicing a tuple is perfectly legal; however, semantically a little murky. Why is ultramarine_blue[:2] used to create a pair from the red and green channel?

A list is a variable-length sequence of items. This is a mutable object and we can insert, append, remove, and replace items in the list. This is one of the profound differences between the tuple and list sequences. A tuple is immutable; once we've built it, we can't change it. A list is mutable.

The following is an example of a list that we can tweak in order to correct the errors in the data:

>>> samples = [8.04, 6.95, 0, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82]
>>> samples[2]= 7.58
>>> samples.append(5.68)
>>> samples
[8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68] 
>>> sum(samples)
82.51000000000002
>>> round(sum(samples)/len(samples),2)
7.5

We've created a list object, samples, and initialized it with 10 values. We've set the value with an index of two; replacing a the zero item with 7.58. We've appended an item at the end of the list.

We've also shown two handy functions that apply to all sequences. However, they're particularly useful for lists. The sum() function adds up the values, reducing the list to a single value. The len() function counts the items, also reducing the list to a single value.

Note the awkward value shown for the sum; this is an important feature of floating-point numbers. In order to be really fast, they're finite. As they have a limited number of bits, they're only an approximation. Therefore, sometimes, we'll see some consequences of working with approximations.

Tip

Floating-point numbers aren't mathematical abstractions.

They're finite approximations. Sometimes, you'll see tiny error values.

One other interesting operator for sequences is the in comparison:

>>> 7.24 in samples
True

This checks whether a given item is found somewhere in the sequence. If we want the index of a given item, we can use the index method:

samples.index(7.24)

Using the dictionary mapping

The general idea of mapping is the association between keys and values. We might have a key of 'ultramarine blue' associated with a value of the tuple, (63, 38, 191). We might have a key of 'sunset orange' associated with a tuple of (254, 76, 64). We can represent this mapping of string-to-tuple with a Python dictionary object, as follows:

>>> colors = {'ultramarine blue': (63, 38, 191), 'sunset orange': (254, 76, 64) }

We've replaced the words associated with : and wrapped the whole in {}s in order to create a proper dictionary. This is a mapping from color strings to RGB tuples.

A dictionary is mutable; we can add new key-value pairs and remove key-value mappings from it. Of course, we can interrogate a dictionary to see what keys are present and what value is associated with a key.

>>> colors['olive green'] = (181, 179, 92)
>>> colors.pop('sunset orange')
(254, 76, 64)
>>> colors['ultramarine blue']
(63, 38, 191)
>>> 'asparagus' in colors
False

The same syntax will replace an existing key in a dictionary with a new value. We can pop a key from the dictionary; this will both update the dictionary to remove the key value pair and return the value associated with the key. When we use syntax such as colors['ultramarine blue'], we'll retrieve the value associated with a given key.

The in operator checks to see whether the given item is one of the keys of the mapping. In our example, we didn't provide a mapping for the name 'asparagus'.

We can retrieve the keys, the values, and the key value pairs from a mapping with methods of the class:

>>> sorted(colors.items())
[('olive green', (181, 179, 92)), ('ultramarine blue', (63, 38, 191))]

The keys() method returns the keys in the mapping. The values() method returns a list of only the values. The items() method returns a list of two-tuples. Each tuple is a key, value pair. We've applied the sorted() function in this example, as a dictionary doesn't guarantee any particular order for the keys. In many cases, we don't particularly care about the order. In the cases where we need to enforce the order, this is a common technique.

Comparing data and using the logic operators

Python implements a number of comparisons. We have the usual ==, !=, <=, >=, <, and > operators. These provide the essential comparison capabilities. The result of a comparison is a boolean object, either True or False.

The boolean objects have their own special logic operators: and, or, and not. These operators can short-circuit the expression evaluation. In the case of and, if the left-hand side expression is False, the final result must be False; therefore, the right-hand side expression is not evaluated. In the case of or, the rules are reversed. If the left-hand side expression is True, the final result is already known to be True, so the right-hand side expression is skipped.

For example, take two variables, sum and count,as follows:

>>> sum
82.51
>>> count
11
>>> mean = count != 0 and sum/count

Let's look closely at the final expression. The left-hand side expression of the and operator is count != 0, which is True. Therefore, the right-hand side expression must be evaluated. Interestingly, the right-hand side object is the final result. A numeric value of 7.5 is the value of the mean variable.

The following is another example to show how the and operator behaves:

>>> sum
0.0
>>> count
0
>>> mean = count != 0 and sum/count

What happens here? The left-hand side expression of the and operator is count != 0, which is False. The right-hand side is not evaluated. There's no division by zero error exception raised by this. The final result is False.

Using some simple statements

All of the preceding examples focused on one-line expression statements. We entered an expression in REPL, Python evaluated the expression, and REPL helpfully printed the resulting value. While the expression statement is handy for experiments at the REPL prompt, there's one expression statement that agents use a lot, as shown in the following:

>>> print("Hello \N{EARTH}")
Hello ♁

The print() function prints the results on the console. We provided a string with a Unicode character that's not directly available on most keyboards, this is the EARTH character, ♁, U+2641, which looks different in different fonts.

We'll need the print() function as soon as we stop using interactive Python. Our scripts won't show any results unless we print them.

The other side of printing is the input() function. This will present a prompt and then read a string of input that is typed by a user at the console. We'll leave it to the interested agent to explore the details of how this works.

We'll need more kinds of imperative statements to get any real work done. We've shown two forms of the assignment statement; both will put a label on an object. The following are two examples to put label on an object:

>>> n, d = 355, 113
>>> pi = n/d

The first assignment statement evaluated the 355, 115 expression and created a tuple object from two integer objects. In some contexts, the surrounding ()s for a tuple are optional; this is one of those contexts. Then, we used multiple assignments to decompose the tuple to its two items and put labels on each object.

The second assignment statement follows the same pattern. The n/d expression is evaluated. It uses true division to create a floating-point result from integer operands. The resulting object has the name pi applied to it by the assignment statement.

Using compound statements for conditions: if

For conditional processing, we use the if statement. Python allows an unlimited number of else-if (elif) clauses, allowing us to build rather complex logic very easily.

For example, here's a statement that determines whether a value, n, is divisible by three, or five, or both:

>>> if n % 3 == 0 and n % 5 == 0:
...     print("fizz-buzz")
... elif n % 3 == 0:
...     print("fizz")
... elif n % 5 == 0:
...     print("buzz")
... else:
...     print(n)

We've written three Boolean expressions. The if statement will evaluate these in top-to-bottom order. If the value of the n variable is divisible by both, three and five, the first condition is True and the indented suite of statements is executed. In this example, the indented suite of statements is a single expression statement that uses the print() function.

If the first expression is False, then the elif clauses are examined in order. If none of the elif clauses are true, the indented suite of statements in the else clause is executed.

Remember that the and operator has a short-circuit capability. The first expression may involve as little as evaluating n % 3 == 0. If this subexpression is False, the entire and expression must be False; this means that the entire if clause is not executed. Otherwise, the entire expression must be evaluated.

Notice that Python changes the prompt from >>> at the start of a compound statement to … to show that more of the statement can be entered. This is a helpful hint. We indent each suite of statements in a clause. We enter a blank line in order to show we're at the very end of the compound statement.

Tip

This longer statement shows us an important syntax rule:

Compound statements rely on indentation. Indent consistently. Use four spaces.

The individual if and elif clauses are separated based on their indentation level. The keywords such as if, elif, and else are not indented. The suite of statements in each clause is indented consistently.

Using compound statements for repetition: for and while

When we want to process all the items in a list or the lines in a file, we're going to use the for statement. The for statement allows us to specify a target variable, a source collection of values, and a suite of statements. The idea is that each item from the source collection is assigned to the target value and the suite of statements is executed.

The following is a complete example that computes the variance of some measurements:

>>> samples = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
>>> sum, sum2 = 0, 0
>>> for x in samples:
...     sum += x
...     sum2 += x**2
>>> n = len(samples)
>>> var = (sum2-(sum**2/n))/(n-1)

We've started with a list of values, assigned to the samples variable, plus two other variables, sum and sum2, to which we've assigned initial values of 0.

The for statement will iterate through the item in the samples list. An item will be assigned to the target variable, x, and then the indented body of the for statement is executed. We've written two assignment statements that will compute the new values for sum and sum2. These use the augmented assignment statement; using += saves us from writing sum = sum + x.

After the for statement, we are assured that the body has been executed for all values in the source object, samples. We can save the count of the samples in a handy local variable, n. This makes the calculation of the variance slightly more clear. In this example, the variance is about 4.13.

The result is a number that shows how spread out the raw data is. The square root of the variance is the standard deviation. We expect two-third of our data points to lie in one standard deviation of the average. We often use variance when comparing two data sets. When we get additional data, perhaps from a different agent, we can compare the averages and variances to see whether the data is similar. If the variances aren't the same, this may reflect that there are different sources and possibly indicate that we shouldn't trust either of the agents that are supplying us this raw data. If the variances are identical, we have another question whether we being fed false information?

The most common use of the for statement is to visit each item in a collection. A slightly less common use is to iterate a finite number of times. We use a range() object to emit a simple sequence of integer values, as follows:

>>> list(range(5))
[0, 1, 2, 3, 4]

This means that we can use a statement such as for i in range(n): in order to iterate n times.

Defining functions

It's often important to decompose large, complex data acquisition and analysis problems into smaller, more solvable problems. Python gives us a variety of ways to organize our software. We have a tall hierarchy that includes packages, modules, classes, and functions. We'll start with function definitions as a way to decompose and reuse functionality. The later missions will require class definitions.

A function—mathematically—is a mapping from objects in a domain to objects in a range. Many mathematical examples map numbers to different numbers. For example, the arctangent function, available as math.atan(), maps a tangent value to the angle that has this tangent value. In many cases, we'll need to use math.atan2(), as our tangent value is a ratio of the lengths of two sides of a triangle; this function maps a pair of values to a single result.

In Python terms, a function has a name and a collection of parameters and it may return a distinct value. If we don't explicitly return a resulting value, a function maps its values to a special None object.

Here's a handy function to average the values in a sequence:

>>> def mean(data):
...     if len(data) == 0:
...         return None
...     return sum(data)/len(data)

This function expects a single parameter, a sequence of values to average. When we evaluate the function, the argument value will be assigned to the data parameter. If the sequence is empty, we'll return the special None object in order to indicate that there's no average when there's no data.

If the sequence isn't empty, we'll divide the sum by the count to compute the average. Since we're using exact division, this will return a floating-point value even if the sequence is all integers.

The following is how it looks when we use our newly minted function combined with built-in functions:

>>> samples = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
>>> round(mean(samples), 2)
7.5

We've computed the mean of the values in the samples variable using our mean() function. We've applied the round() function to the resulting value to show that the mean is rounded to two decimal places.

Creating script files

We shouldn't try to do all the our data gathering and analysis by entering the Python code interactively at the >>> prompt. It's possible to work this way; however, the copy and paste is tedious and error-prone. It's much better to create a Python script that will gather, analyze, and display useful intelligence assets that we've gathered (or purchased).

A Python script is a file of Python statements. While it's not required, it's helpful to be sure that the file's name is a valid Python symbol that is created with letters, numbers, and _'s. It's also helpful if the file's name ends with .py.

Here's a simple script file that shows some of the features that we've been looking at:

import random, math
samples = int(input("How many samples: "))
inside = 0
for i in range(samples):
    if math.hypot(random.random(), random.random()) <= 1.0:
        inside += 1
print(inside, samples, inside/samples, math.pi/4)

This script file can be given a name such as example1.py. The script will use the input() function to prompt the user for a number of random samples. Since the result of input() is a string, we'll need to convert the string to an integer in order to be able to use it. We've initialized a variable, inside, to zero.

The for statement will execute the indented body for the number of times that are given by the value of samples. The range() object will generate samples distinct integer values. In the for statement, we've used an if statement to filter some randomly generated values. The values we're examining are the result of math.hypot(random.random(), random.random()). What is this value? It's the hypotenuse of a right angled triangle with sides that are selected randomly. We'll leave it to each field agent to rewrite this script in order to assign and print some intermediate variables to show precisely how this calculation works.

We're looking at a triangle with one vertex at (0,0) and another at (x,y). The third vertex could either be at (0,y) or (x,0), the results don't depend on how we visualize the triangle. Since the triangle sides are selected randomly, the end point of the hypotenuse can be any value from (0,0) to (1,1); the length of this varies between 0 and Creating script files .

Statistically, we expect that most of the points should lie in a circle with a radius of one. How many should lie in this quarter circle? Interestingly, the random distribution will have Creating script files of the samples in the circle; will be outside the circle.

When working in counterintelligence, the data that we're providing needs to be plausible. If we're going to mislead, our fake data needs to fit the basic statistical rules. A careful study of history will show how Operation Mincemeat was used to deceive Axis powers during World War II. What's central to this story is the plausibility of every nuance of the data that is supplied.

Mission One – upgrade Beautiful Soup

It seems like the first practical piece of software that every agent needs is Beautiful Soup. We often make extensive use of this to extract meaningful information from HTML web pages. A great deal of the world's information is published in the HTML format. Sadly, browsers must tolerate broken HTML. Even worse, website designers have no incentive to make their HTML simple. This means that HTML extraction is something every agent needs to master.

Upgrading the Beautiful Soup package is a core mission that sets us up to do more useful espionage work. First, check the PyPI description of the package. Here's the URL: https://pypi.python.org/pypi/beautifulsoup4. The language is described as Python 3, which is usually a good indication that the package will work with any release of Python 3.

To confirm the Python 3 compatibility, track down the source of this at the following URL:

http://www.crummy.com/software/BeautifulSoup/.

This page simply lists Python 3 without any specific minor version number. That's encouraging. We can even look at the following link to see more details of the development of this package:

https://groups.google.com/forum/#!forum/beautifulsoup

The installation is generally just as follows:

MacBookPro-SLott:Code slott$ sudo pip3.4 install beautifulsoup4

Windows agents can omit the sudo prefix.

This will use the pip application to download and install BeautifulSoup. The output will look as shown in the following:

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.3.2.tar.gz (143kB)
    100% |████████████████████████████████| 143kB 1.1MB/s 
Installing collected packages: beautifulsoup4
  Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4-4.3.2

Note that Pip 7 on Macintosh uses the █ character instead of # to show status. The installation was reported as successful. That means we can start using the package to analyze the data.

We'll finish this mission by gathering and parsing a very simple page of data.

We need to help agents make the sometimes dangerous crossing of the Gulf Stream between Florida and the Bahamas. Often, Bimini is used as a stopover; however, some faster boats can go all the way from Florida to Nassau in a single day. On a slower boat, the weather can change and an accurate multi-day forecast is essential.

The Georef code for this area is GHLL140032. For more information, look at the 25°32′N 79°46′W position on a world map. This will show the particular stretch of ocean for which we need to supply forecast data.

Here's a handy URL that provides weather forecasts for agents who are trying to make the passage between Florida and the Bahamas:

http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101.

This page includes a weather synopsis for the overall South Atlantic (the amz101 zone) and a day-by-day forecast specific to the Bahamas (the amz117 zone). We want to trim this down to the relevant text.

Getting an HTML page

The first step in using BeautifulSoup is to get the HTML page from the US National Weather Service and parse it in a proper document structure. We'll use urllib to get the document and create a Soup structure from that. Here's the essential processing:

from bs4 import BeautifulSoup
import urllib.request
query= "http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101"
with urllib.request.urlopen(query) as amz117:
    document= BeautifulSoup(amz117.read())

We've opened a URL and assigned the file-like object to the amz117 variable. We've done this in a with statement. Using with will guarantee that all network resources are properly disconnected when the execution leaves the indented body of the statement.

In the with statement, we've read the entire document available at the given URL. We've provided the sequence of bytes to the BeautifulSoup parser, which creates a parsed Soup data structure that we can assign to the document variable.

The with statement makes an important guarantee; when the indented body is complete, the resource manager will close. In this example, the indented body is a single statement that reads the data from the URL and parses it to create a BeautifulSoup object. The resource manager is the connection to the Internet based on the given URL. We want to be absolutely sure that all operating system (and Python) resources that make this open connection work are properly released. This release when finished guarantees what the with statement offers.

Navigating the HTML structure

HTML documents are a mixture of tags and text. The parsed structure is iterable, allowing us to work through text and tags using the for statement. Additionally, the parsed structure contains numerous methods to search for arbitrary features in the document.

Here's the first example of using methods names to pick apart a document:

content= document.body.find('div', id='content').div

When we use a tag name, such as body, as an attribute name, this is a search request for the first occurrence of that tag in the given container. We've used document.body to find the <body> tag in the overall HTML document.

The find() method finds the first matching instance using more complex criteria than the tag's name. In this case, we've asked to find <div id="content"> in the body tag of the document. In this identified <div>, we need to find the first nested <div> tag. This division has the synopsis and forecast.

The content in this division consists of a mixed sequence of text and tags. A little searching shows us that the synopsis text is the fifth item. Since Python sequences are based at zero, this has an index of four in the <div>. We'll use the contents attribute of a given object to identify tags or text blocks by position in a document object.

The following is how we can get the synopsis and forecast. Once we have the forecast, we'll need to create an iterator for each day in the forecast:

synopsis = content.contents[4]
forecast = content.contents[5]
strong_list = list(forecast.findAll('strong'))
timestamp_tag, *forecast_list = strong_list

We've extracted the synopsis as a block of text. HTML has a quirky feature of an <hr> tag that contains the forecast. This is, in principle, invalid HTML. Even though it seems invalid, browsers tolerate it. It has the data that we want, so we're forced to work with it as we find it.

In the forecast <hr> tag, we've used the findAll() method to create a list of the sequence of <strong> tags. These tags are interleaved between blocks of text. Generally, the text in the tag tells us the day and the text between the <strong> tags is the forecast for that day. We emphasize generally as there's a tiny, but important special case.

Due to the special case, we've split the strong_list sequence into a head and a tail. The first item in the list is assigned to the timestamp_tag variable. All the remaining items are assigned to the forecast_list variable. We can use the value of timestamp_tag.string to recover the string value in the tag, which will be the timestamp for the forecast.

Your extension to this mission is to parse this with datetime.datetime.strptime(). It will improve the overall utility of the data in order to replace strings with proper datetime objects.

The value of the forecast_list variable is an alternating sequence of <strong> tags and forecast text. Here's how we can extract these pairs from the overall document:

for strong in forecast_list:
    desc= strong.string.strip()
    print( desc, strong.nextSibling.string.strip() )

We've written a loop to step through the rest of the <strong> tags in the forecast_list object. Each item is a highlighted label for a given day. The value of strong.nextSibling will be the document object after the <strong> tag. We can use strong.nextSibling.string to extract the string from this block of text; this will be the details of the forecast.

We've used the strip() method of the string to remove extraneous whitespace around the forecast elements. This makes the resulting text block more compact.

With a little more cleanup, we can have a tidy forecast that looks similar to the following:

TONIGHT 2015-06-30
--------------------
E TO SE WINDS 10 TO 15 KT...INCREASING TO 15 TO 20 KT
 LATE. SEAS 3 TO 5 FT ATLC EXPOSURES...AND 2 FT OR LESS
 ELSEWHERE.
WED 2015-07-01
--------------------
E TO SE WINDS 15 TO 20 KT...DIMINISHING TO 10 TO 15 KT
 LATE. SEAS 4 TO 6 FT ATLC EXPOSURES...AND 2 FT OR
 LESS ELSEWHERE.

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

We've stripped away a great deal of HTML overhead. We've reduced the forecast to the barest facts. With a little more fiddling, we can get it down to a pretty tiny block of text. We might want to represent this in JavaScript Object Notation (JSON). We can then encrypt the JSON string before the transmission. Then, we could use steganography to embed the encrypted text in another kind of document in order to transmit to a friendly ship captain that is working the route between Key Biscayne and Bimini. It may look as if we're just sending each other pictures of rainbow butterfly unicorn kittens.

This mission demonstrates that we can use Python 3, urllib, and BeautifulSoup. Now, we've got a working environment.

Mission to expand our toolkit

Now that we know our Python 3 is up-to-date, we can add some additional tools. We'll be using several advanced packages to help in acquiring and analyzing raw data.

We're going to need to tap into the social network. There are a large number of candidate social networks that we could mine for information. We'll start with Twitter. We can access the Twitter feed using direct API requests. Rather than working through the protocols at a low level, we'll make use of a Python package that provides some simplifications.

Our first choice is the Twitter API project on PyPI, as follows: https://pypi.python.org/pypi/TwitterAPI/2.3.3.

This can be installed using sudo pip3.4 install twitterapi.

We have some alternatives, one of which is the Twitter project from sixohsix. Here's the URL: https://pypi.python.org/pypi/twitter/1.17.0.

We can install this using sudo pip3.4 install twitter.

We'll focus on the twitterapi package. Here's what happens when we do the installation:

MacBookPro-SLott:Code slott$ sudo -H pip3.4 install twitterapi
Password:
Collecting twitterapi
  Downloading TwitterAPI-2.3.3.1.tar.gz
Collecting requests (from twitterapi)
  Downloading requests-2.7.0-py2.py3-none-any.whl (470kB)
    100% |████████████████████████████████| 471kB 751kB/s 
Collecting requests-oauthlib (from twitterapi)
  Downloading requests_oauthlib-0.5.0-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests-oauthlib->twitterapi)
  Downloading oauthlib-0.7.2.tar.gz (106kB)
    100% |████████████████████████████████| 106kB 1.6MB/s 
Installing collected packages: requests, oauthlib, requests-oauthlib, twitterapi
  Running setup.py install for oauthlib
  Running setup.py install for twitterapi
Successfully installed oauthlib-0.7.2 requests-2.7.0 requests-oauthlib-0.5.0 twitterapi-2.3.3.1

We used the sudo -H option, as required by Mac OS X. Windows agents would omit this. Some Linux agents can omit the -H option as it may be the default behavior.

Note that four packages were installed. The twitterapi package included the requests and requests-oauthlib packages. This, in turn, required the oauthlib package, which was downloaded automatically for us.

The missions for using this package start in Chapter 3, Following the Social Network. For now, we'll count the installation as a successful preliminary mission.

Scraping data from PDF files

In addition to HTML, a great deal of data is packaged as PDF files. PDF files are designed as the requirements to produce the printed output consistently across a variety of devices. When we look at the structure of these documents, we find that we have a complex and compressed storage format. In this structure, there are fonts, rasterized images, and descriptions of text elements in a simplified version of the PostScript language.

There are several issues the come into play here, as follows:

The files are quite complex. We don't want to tackle the algorithms that are required to read the streams encoded in the PDF since we're focused on the content.
The content is organized for tidy printing. What we perceive as a single page of text is really just a collection of text blobs. We've been taught how to identify the text blobs as headers, footers, sidebars, titles, code examples, and other semantic features of a page. This is actually a pretty sophisticated bit of pattern matching. There's an implicit agreement between readers and book designers to stick to some rules to place the content on the pages.
It's possible that a PDF can be created from a scanned image. This will require Optical Character Recognition (OCR) in order to recover useful text from the image.

In order to extract text from a PDF, we'll need to use a tool such as the PDF Miner 3k. Look for this package at https://pypi.python.org/pypi/pdfminer3k/1.3.0.

An alternative is the pdf package. You can look at:

https://pypi.python.org/pypi/PDF/1.0 for the package.

In Chapter 4, Dredging up History, we'll look at the kinds of algorithms that we'll need to write in order to extract useful content from PDF files.

However, for now, we need to install this package in order to be sure that we can process PDF files. We'll use sudo -H pip3.4 install pdfminer3k to do the installation. The output looks as shown in the following:

MacBookPro-SLott:Code slott$ sudo -H pip3.4 install pdfminer3k
Collecting pdfminer3k
  Downloading pdfminer3k-1.3.0.tar.gz (9.7MB)
    100% |████████████████████████████████| 9.7MB 55kB/s 
Collecting pytest>=2.0 (from pdfminer3k)
  Downloading pytest-2.7.2-py2.py3-none-any.whl (127kB)
    100% |████████████████████████████████| 131kB 385kB/s 
Collecting ply>=3.4 (from pdfminer3k)
  Downloading ply-3.6.tar.gz (281kB)
    100% |████████████████████████████████| 282kB 326kB/s 
Collecting py>=1.4.29 (from pytest>=2.0->pdfminer3k)
  Downloading py-1.4.30-py2.py3-none-any.whl (81kB)
    100% |████████████████████████████████| 86kB 143kB/s 
Installing collected packages: py, pytest, ply, pdfminer3k
  Running setup.py install for ply
  Running setup.py install for pdfminer3k
Successfully installed pdfminer3k-1.3.0 ply-3.6 py-1.4.30 pytest-2.7.2

Windows agents will omit the sudo -H prefix. This is a large and complex installation. The package itself is pretty big (almost 10 Mb.) It requires additional packages such as pytest, and py. It also incorporates ply, which is an interesting tool in its own right.

Interestingly, the documentation for how to use this package can be hard to locate. Here's the link to locate it:

http://www.unixuser.org/~euske/python/pdfminer/index.html.

Note that the documentation is older than the actual package as it says (in red) Python 3 is not supported. However, the pdfminer3k project clearly states that pdfminer3k is a Python 3 port of pdfminer. While the software may have been upgraded, some of the documentation still needs work.

We can learn more about ply here at https://pypi.python.org/pypi/ply/3.6. The lex and yacc summary may not be too helpful for most of the agents. These terms refer to the two classic programs that are widely used to create the tools that support software development.

Sidebar on the ply package

When we work with the Python language, we rarely give much thought on how the Python program actually works. We're mostly interested in results, not the details of how Python language statements lead to useful processing by the Python program. The ply package solves the problem of translating characters to meaningful syntax.

Agents that are interested in the details of how Python works will need to consider the source text that we write. When we write the Python code, we're writing a sequence of intermingled keywords, symbols, operators, and punctuation. These various language elements are just sequences of Unicode characters that follow a strict set of rules. One wrong character and we get errors from Python.

There's a two-tier process to translate a .py file of the source text to something that is actionable.

At the lowest tier, an algorithm must do the lexical scanning of our text. A lexical scanner identifies the keywords, symbols, literals, operators, and punctuation marks; the generic term for these various language elements is tokens. A classic program to create lexical scanners is called lex. The lex program uses a set of rules to transform a sequence of characters into a sequence of higher-level tokens.

The process of compiling Python tokens to useful statements is the second tier. The classic program for this is called Yacc (Yet Another Compiler Compiler). The yacc language contained the rules to interpret a sequence of tokens as a valid statement in the language. Associated with the rules to parse a target language, the yacc language also contained statements for an action to be taken when the statement was recognized. The yacc program compiles the rules and statements into a new program that we call a compiler.

The ply Python package implements both the tiers. We can use it to define a lexical scanner and a parser that is based on the the classic lex and yacc concepts. Software developers will use a tool such as ply to process statements in a well-defined formal language.

Building our own gadgets

Sometimes, we need to move beyond the data that is readily available on computers. We might need to build our own devices for espionage. There are a number of handy platforms that we can use to build our own sensors and gadgets. These are all single-board computers. These computers have a few high-level interfaces, often USB-based, along with a lot of low-level interfaces that allow us to create simple and interactive devices.

To work with these, we'll create a software on a large computer, such as a laptop or desktop system. We'll upload our software to our a single board computer and experiment with the gadget that we're building.

There are a variety of these single board computers. Two popular choices are the Raspberry Pi and the Arduino. One of the notable differences between these devices is that a Raspberry Pi runs a small GNU/Linux operating system, where as an Arduino doesn't offer much in the way of OS features.

Both devices allow us to create simple, interactive devices. There are ways to run Python on Raspberry Pi using the RPi GPIO module. Our gadget development needs to focus on Arduino as there is a rich variety of hardware that we can use. We can find small, robust Arduinos that are suitable for harsh environments.

A simple Arduino Uno isn't the only thing that we'll need. We'll also need some sensors and wires. We'll save the detailed shopping list for Chapter 5, Data Collection Gadgets. At this point, we're only interested in software tools.

Getting the Arduino IDE

To work with Arduino, we'll need to download the Arduino Integrated Development Environment (IDE.) This will allow us to write programs in the Arduino language, upload them to our Arduino, and do some basic debugging. An Arduino program is called a sketch.

We'll need to get the Arduino IDE from https://www.arduino.cc/en/Main/Software. On the right-hand side of this web page, you can pick the OS for our working computer and download the proper Arduino tool set. Some agents prefer the idea of making a contribution to the Arduino foundation. However, it's possible to download the IDE without making a contribution.

For Mac OS X, the download will be a .ZIP file. This will unpack itself in the IDE application; we can copy this to our Applications folder and we're ready to go.

For Windows agents, we can download a .MSI file that will do the complete installation. This is preferred for computers where we have full administrative access. In some cases, where we may not have administrative rights, we'll need to download the .ZIP file, which we can unpack in a C:\Arduino directory.

We can open the Arduino application to see an initial sketch. The screen looks something similar to the following screenshot:

The sketch name will be based on the date on which you run the application. Also, the communications port shown in the lower right-hand corner may change, depending on whether your Arduino is plugged in.

We don't want to do anything more than be sure that the Arduino IDE program runs. Once we see that things are working, we can quit the IDE application.

An alternative is the Fritzing application. Refer to http://www.fritzing.org for more information. We can use this software to create engineering diagrams and lists of parts for a particular gadget. In some cases, we can also use this to save software sketches that are associated with a gadget. The Arduino IDE is used by the Fritzing tool. Go to http://fritzing.org/download/ to download Fritzing.

Getting a Python serial interface

In many cases, we'll want to have a more complex interaction between a desktop computer and an Arduino-based sensor. This will often lead to using the USB devices on our computer from our Python applications. If we want to interact directly with an Arduino (or other single-board computer) from Python, we'll need PySerial. An alternate is the USPP (Universal Serial Port Python) library. This allows us to communicate without having the Arduino IDE running on our computer. It allows us separate our data that is being gathered from our software development.

For PySerial, refer to https://pypi.python.org/pypi/pyserial/2.7. We can install this with sudo -H pip3.4 install pyserial.

Here's how the installation looks:

MacBookPro-SLott:Code slott$ sudo -H pip3.4 install pyserial
Password:
Collecting pyserial
  Downloading pyserial-2.7.tar.gz (122kB)
    100% |████████████████████████████████| 122kB 1.5MB/s 
Installing collected packages: pyserial
  Running setup.py install for pyserial
Successfully installed pyserial-2.7

Windows agents will omit the sudo -H command. This command has downloaded and installed the small PySerial module.

We can leverage this to communicate with an Arduino (or any other device) through a USB port. We'll look at the interaction in Chapter 5, Data Collection Gadgets.