Packt+ | Advance your knowledge in tech

You're reading from Python for Secret Agents - Volume II - Second Edition

Product type Book

Published in Dec 2015

Publisher

ISBN-13 9781785283406

Pages 180 pages

Edition 2nd Edition

Languages

Python

Concepts

Cybersecurity

Authors (2):

Steven F. Lott

View More author details

Table of Contents (12) Chapters

Python for Secret Agents Volume II

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

1. New Missions – New Tools

2. Tracks, Trails, and Logs

3. Following the Social Network

4. Dredging up History

5. Data Collection Gadgets

Index

Chapter 2. Tracks, Trails, and Logs

In many cases, espionage is about data: primary facts and figures that help make an informed decision. It can be military, but it's more commonly economic or engineering in nature. Where's the best place to locate a new building? How well is the other team really doing? Among all of these prospects, which is the best choice?

In some cases, we're looking for data that's one step removed from the primary facts. We're might need to know who's downloading the current team statistics? Who's reading the press-release information? Who's writing the bulk of the comments in our comments section? Which documents are really being downloaded? What is the pattern of access?

We're going to get data about the users and sources of some primary data. It's commonly called metadata: data about the primary data. It's the lifeblood of counter-intelligence.

We'll get essential web server metadata first. We'll scrape our web server's logs for details of website traffic. One of...

Background briefing – web servers and logs

At its heart, the World Wide Web is a vast collection of computers that handle the HTTP protocol. The HTTP protocol defines a request message and a response. A web server handles these requests, creating appropriate responses. This activity is written to a log, and we're interested in that log.

When we interact with a complex web site for a company that conducts e-business—buying or selling on the web—it can seem a lot more sophisticated than this simplistic request and reply protocol. This apparent complexity arises from an HTML web page, which includes JavaScript programming. This extra layer of code can make requests and process replies in ways that aren't obvious to the user of the site.

All web site processing begins with some initial request for an HTML web page. Other requests from JavaScript programs will be data requests that don't lead to a complete HTML page being sent from the server. It's common for JavaScript programs to request JSON...

Writing a regular expression for parsing

The logs look complex. Here's a sample line from a log:

109.128.44.217 - - [31/May/2015:22:55:59 -0400] "GET / HTTP/1.1" 200 14376 "-" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"

How can we pick this apart? Python offers us regular expressions as a way to describe (and parse) this string of characters.

We write a regular expression as a way of defining a set of strings. The set can be very small and have only a single string in it, or the set can be large and describe an infinite number of related strings. We have two issues that we have to overcome: how do we specify infinite sets? How can we separate those characters that help specify a rule from characters that just mean themselves?

For example, we might write a regular expression like aabr. This specifies a set that contains a single string. This regular expression looks like the mathematical expression a×a×b×r that...

Reading and understanding the raw data

Files come in a variety of formats. Even a file that appears to be simple text is often a UTF-8 encoding of Unicode characters. When we're processing data to extract intelligence, we need to look at three tiers of representation:

Physical Format: We might have a text file encoded in UTF-8, or we might have a GZIP file, which is a compressed version of the text file. Across these different physical formats, we can find a common structure. In the case of log files, the common structure is a line of text which represents a single event.
Logical Layout: After we've extracted data from the physical form, we often find that the order of the fields is slightly different or some optional fields are missing. The trick of using named groups in a regular expression gives us a way to handle variations in the logical layouts by using different regular expressions depending on the details of the layout.
Conceptual Content: This is the data we were looking for, represented...

Reading remote files

We've given these functions names such as local_text and local_gzip because the files are located on our local machine. We might want to write other variations that use urrlib.request.urlopen() to open remote files. For example, we might have a log file on a remote server that we'd like to process. This allows us to write a generator function, which yields lines from a remote file allowing us to interleave processing and downloading in a single operation.

We can use the urllib.request module to handle remote files using URLs of this form: ftp://username:password@/server/path/to/file. We can also use URLs of the form file:///path/to/file to read local files. Because of this transparency, we might want to look at using urllib.request for all file access.

As a practical matter, it's somewhat more common to use FTP to acquire files in bulk.

Studying a log in more detail

A file is the serialized representation for Python objects. In some rare cases, the objects are strings, and we can deserialize the strings from the text file directly. In the case of our web server logs, some of the strings represent a date-time stamp. Also, the size of the transmitted content shouldn't be treated as a string, since it's properly either an integer size or the None object if nothing was transmitted to the browser.

When requests for analysis come in, we'll often have to convert objects from strings to more useful Python objects. Generally, we're happiest if we simply convert everything into a useful, native Python data structure.

What kind of data structure should we use? We can't continue to use a Match object: it only knows about strings. We want to work with integers and datetimes.

The first answer is often to create a customized class that will hold the various attributes from a single entry in a log. This gives the most flexibility. It may...

What are they downloading?

In order to see what people are downloading, we'll need to parse the request field. This field has three elements: a method, a path, and a protocol. The method is almost always GET and the protocol is almost always HTTP/1.1. The path, however, shows the resource which was requested. This tells us what people are reading from a given website.

In our case, we can expand on the processing done in log_event_1() to gather the path information. It's a small change, and we'll add this line:

        event.method, event.path, event.protocol = event.request.split(" ")

This will update the event object by splitting the event.request attribute to create three separate attributes: event.method, event.path, and event.protocol.

We'll leave it to each individual agent to create the log_event_2() function from their log_event_1() function. It's helpful to have sample data and some kind of simple unit test to be sure that this works. We can use this log_event_2() function as follows...

Trails of activity

We can leverage the referrer (famously misspelled referer) information to track access around a web site. As with other interesting fields, we need to decompose this into host name and path information. The most reliable way to do this is to use the urllib.parse module.

This means that we'll need to make a change to our log_event_2() function to add yet another parsing step. When we parse the referrer URL, we'll get at least six pieces of information:

scheme: This is usually http.
netloc: This is the server which made the referral. This will be the name of the server, not the IP address.
path: This is the path to the page which had the link.
params: This can be anything after the ? symbol in a URL. Usually, this is empty for simple static content sites.
fragment: This can be anything after the # in a URL.

These details are items within a Namedtuple object: we can refer to them by name or by position within the tuple. We have three ways to handle the parsing of URLs:

We can...

Who is this person?

We can learn more about an IP address using the Whois program. For agents with Linux or Mac OS X, the Whois program is built-in. Agents using Windows may want to download and install a whois program. See https://technet.microsoft.com/en-us/sysinternals/bb897435.aspx for more information.

The Whois program will examine the various registries used to track the names of servers on the internet. It will provide whatever information is available for a given server. This often includes the name of a person or organization that owns the server.

We'll start by using the built-in whois program. An alternative is to make a REST API request to a whois service using urllib. We're going to defer making REST API requests to the Chapter 3, Following the Social Network.

The Whois program makes a request of a server and displays the results. The request is a single line of text, usually containing a domain name or IP address. The response from the server is a flood of text providing information...

Getting logs from a server with ftplib

When we've created an analysis that HQ finds useful, we'll often have to scale this up to work on a larger supply of log files. This will involve acquiring and downloading files from servers without manually clicking a link to download and save each file.

We'll provide a sample of how we might use Python's ftplib to acquire files in bulk for analysis. Once we have the files locally, we can process them using our local_gzip() or local_text() functions.

Here's a function that performs a complex of FTP interaction:

import ftplib
def download( host, path, username=None ):
    with ftplib.FTP(host, timeout=10) as ftp:

        if username:
            password = getpass.getpass("Password: ")
            ftp.login(user=username,passwd=password)
        else:
            ftp.login()

        ftp.cwd(path)
        for name, facts in ftp.mlsd(".", ["type","size"]):
            if name.startswith("."): continue
            if facts['type'] == 'dir': continue
  ...

Summary

In this chapter, we discussed a large number of elements of data analysis. We've looked at how we have to disentangle physical format from logical layout and conceptual content. We covered the gzip module as an example of how we can handle one particularly complex physical format issue.

We focused a lot of attention on using the re module to write regular expressions that help us parse complex text files. This addresses a number of logical layout considerations. Once we've parsed the text, we can then do data conversions to create proper Python objects so that we have useful conceptual content.

We also saw how we can use a collections.Counter object to summarize data. This helps us find the most common items, or create complete histograms and frequency tables.

The subprocess module helped us run the whois program to gather data from around the internet. The general approach to using subprocess allows us to leverage a number of common utilities for getting information about the internet...