Analyzing a Complex Dataset

April 2014

(For more resources related to this topic, see here.)

We may need to analyze volumes of data that are too large for a simple spreadsheet. A document with more than a few hundred rows can rapidly become bewildering. Handling thousands of rows can be very challenging, indeed.

How can we do this more sophisticated analysis? The answer for many such analytic problems is Python. It handles larger sets of data with ease. We can very easily write sophisticated sorting, filtering and calculating rules. We're not limited by the row-and-column structure, either.

We'll look at some data that—on the surface—is boring, prosaic budget information. We've chosen budgets because they're relatively simple to understand with a lot of problem domain background. Specifically, we'll download the budget for the city of Chicago, Illinois. Why Chicago? Chicago has made their data available to the general public, so we can use this real-world data for an example that's not contrived or simplified. A number of other municipalities have also made similar data available.

How do we proceed? Clearly, step one is to get the data so we can see what we're working with.

For more information, we can start with this URL:

The city's data portal offers a number of formats: CSV, JSON, PDF, RDF, RSS, XLS, XLSX, and XML. Of these, the JSON is perhaps the easiest to work with.

We can acquire the appropriation information with the following URL:

This will yield a JSON-format document. We can gather this data and cache it locally with a small Python script.

import urllib.request budget_url= "
rows.json?accessType=DOWNLOAD" with open( "budget_appropriations.json", "w") as target: with urllib.request.urlopen(budget_url) as document: target.write( )

We can use a similar script to gather the salary data that goes with the budget appropriation. The salary information will use this URL:

Clearly, we can create a more general script to download from these two slightly different URL's to gather two very different JSON files. We'll focus on the appropriations for this article because the data organization turns out to be simpler.

We could download this data manually using a browser. It rapidly becomes difficult to automate data collection and analysis when we introduce a manual step involving a person pointing and clicking using a browser. A similar comment applies to trying to use a spreadsheet to analyze the data: merely putting it on a computer doesn't really automate or formalize a process.

The results of step one, then, are two files: budget_appropriations.json and budget_salaries.json.

Parsing the JSON document

Since the data is encoded in JSON, we can simply open the files in our development environment to see what they look like. Informally, it appears that two data sets have some common columns and some distinct columns. We'll need to create a more useful side-by-side comparison of the two files.

We'll import the JSON module. We almost always want to pretty-print during exploration, so we'll import the pprint() function just in case we need it. Here are the first two imports:

import json from pprint import pprint

One thing we may have noted when looking at the JSON is that are two important-looking keys: a 'meta' key and a 'data' key. The 'meta' key is associated with a sequence of column definitions. The 'data' object is associated with a sequence of rows of actual data.

We can use a quick script like the following to discover the details of the metadata:

def column_extract( ): for filename in "budget_appropriations.json", "budget_salaries.json": with open(filename) as source: print( filename ) dataset= json.load( source ) for col in dataset['meta']['view']['columns']: if col['dataTypeName'] != "meta_data": print( col['fieldName'], col['dataTypeName'] ) print()

We've opened each of our source files, and loaded the JSON document into an internal mapping named 'dataset'. The metadata can be found by navigating through the dictionaries that are part of this document. The path is dataset['meta']['view']['columns']. This leads to a sequence of column definitions.

For each column definition, we can print out two relevant attributes using the keys 'fieldName' and 'dataTypeName'. This will reveal items that are dimensions and items that are facts within this big pile of data.

This small function can be used in a short script file to see the various columns involved. We can write a short script like this:

if __name__ == "__main__": column_extract()

We can see that we have columns which are any type number, text, and money. The number and text columns can be termed "dimensions", they describe the facts. The money columns are the essential facts that we'd like to analyze.

Designing around a user story

Now that we have the data, it makes sense to see where we're going with it. Our goal is to support queries against salaries and appropriations. Our users want to see various kinds of subtotals and some correlations between the two datasets.

The terms "various kinds" reveal that the final analysis details are open-ended. It's most important for us to build a useful model of the data rather than solve some very specific problem. Once we have the data in a usable model, we can solve a number of specific problems.

A good approach is to follow Star Schema or Facts and Dimensions design pattern that supports data warehouse design. We'll decompose each table into facts—measurements that have defined units—and dimensions that describe those facts. We might also call the dimensions attributes of Business Entities that are measured by the facts. The facts in a budget analysis context are almost always money. Almost everything else will be some kind of dimension: time, geography, legal organization, government service, or financial structure.

In the long run, we might like to load a relational database using Python objects. This allows a variety of tools to access the data. It would lead to more complex technology stack. For example, we'd need an Object-Relational Mapping (ORM) layer in addition to a star schema layer.

For now, we'll populate a pure python model. We'll show how this model can be extended to support SQLAlchemy as an ORM layer.

Reading the raw data and metadata

The JSON documents are fetched as dictionaries. This can be very handy. It fits well with the CSV module's approach to working with tabular data. We can, however, make a tiny to make the data slightly easier to work with.

The dictionary items can be accessed via "row['column']" syntax. If we change the collection to a SimpleNamespace class, this allows us to use row.column syntax.

Here's a dictionary reader as an iterator over the rows in the original JSON document:

def dict_reader( source ): dataset= json.load( source ) cols= meta_view_columns( dataset ) for row in dataset['data']: yield dict( (m.fieldName, v) for m,v in zip(cols,row) )

This does several things:

  • First, it loads the JSON data into an in-memory collection, dataset.
  • Second, it extracts the metadata from the JSON document, so that we can match column names to positions in the source data. We've used a function, meta_view_columns(), here that transforms the metadata to a simple list of relevant columns. We'll show that function below.
  • Third, it yields each row of data as a dictionary that maps a field name to value.

Here's how we can define the metadata. We'll start with a named tuple for the various metadata attributes present in the JSON document:

from collections import namedtuple Metadata= namedtuple("Metadata", ("order","name","dataTypeName","fieldName","position",
"renderTypeName") )

We've added one additional attribute, named order. The position attribute doesn't seem to be the proper position of the attribute in the JSON document.

We'll use the following function as a factor to build these Metadata instances from the JSON document information:

def meta_view_columns( dataset ): return [ Metadata( n, *(row[x] for x in Metadata._fields[1:]) ) for n, row in enumerate(dataset['meta']['view']['columns']) ]

This function examines all of the column definitions in the JSON document. The various field names are used to build attributes of each Metadata object. We've built the "order" attribute based on the enumerated order of the metadata column definitions.

As shown above, we use this function on a dataset to extract the metadata and return a sequence of column definitions. We can then use that sequence of column definitions to interpret each row of raw data. This will allow us to associate a field name and data type with the data.

To simplify the syntax of the application slightly, we'll wrap each row's dictionary into a types.SimpleNamespace using the value of the metadata fieldName attribute as the attribute name. Here's the function that transforms a sequence of dictionaries to a sequence of namespace objects:

def ns_reader( source ): return (SimpleNamespace(**row_dict) for row_dict in dict_

We could build this into the dict_reader() function, but it seems better to keep it separate. For example, some of the metadata columns don't have valid Python names, making the ns_reader() less useful for these columns. Also, we may want to alias a column to change its name, which would insert some additional processing.

We can see how this works with a simple script like the following:

if __name__ == "__main__": with open("budget_appropriations.json") as source: for row in ns_reader( source ): print( row )

This will open a file, and use the namespace reader to parse and print the rows. This is quite a voluminous barrage of output. However, it shows us that we're on the right track toward analyzing this data set.

Understanding the dimension problem

When working with the star schema design, we are often going to accumulate the business entities or dimensional objects incrementally. We might get a few attributes from one source, and more attributes from another source. When loading the fact table, we'll also be creating a more complete view of each entity that forms the dimensions for that fact.

The processing pipeline for loading a warehouse often involves a series of dimensional update steps. Each of these steps will determine if the data is part of an existing dimension, an extension to that dimension or a change to an object along that dimension.

Let's assume we have additional data that reflects organizational structure of the Chicago City government. When we load the budget data, we might only get an organizational code and name. Perhaps, a few levels of the hierarchy around that organization.

If we loaded the city directory into our warehouse, we might be able to include additional information about each department, including mail stop and phone number.

We might load several versions of a budget into a warehouse, each with a common view of the organizational dimension. This would be typical. Each fact row would have a department number and name that matches data we've already seen.

A subsequent year, however, may introduce additional departments into the organization. This would extend the organizational dimension. We'd have to reload the city directory to capture mail stop and phone number information.

After several years, we might detect that the name of a department has changed to reflect a new organization. In this case, we have a dimensional change. There are many algorithms for tracking dimensional change. The advantages and disadvantages go beyond the scope of this article, which is narrowly focused on a Python implementation.

Designing the dimension metaclass

In order to properly accumulate dimensional information, we'll need to make each Dimension a container. We might be tempted to use a set. What we often find is that some attributes of a dimension are more useful for identifying a unique instance than other. An organizational code, for example, tends to be used very consistently. A phone number or mail drop may change haphazardly through time.

Consequently, a dimension will often have one or more indexes to map identifying attributes to the dimension object that contains the complete set of attributes.

We'll define each dimension class to include a container for all of the instances of the class. Our goal is to implement a "get-or-create" API for each instance of the dimension. If a dimensional entity already exists, we want to get the existing copy; this permits the accretion of additional information. If the entity doesn't already exist, we want to create just one master copy of it.

We'd like to load objects using a snippet of code like the following:

dept= (Department.get(department_number=row.department_number) or Department(row))

This will check the dimension to see if the given object already exists. In this case, we'll use the input value of department_number to search our collection. If the object doesn't exist, we'll create it. The __init__() constructor will use just the relevant attributes and also automatically add it to the collection.

We can't readily combine this into a single class method which performs the "get-or-create". The issue is that different sources may have different ways to identify the department. We might have a department number in one source, a department code in another source, and the department name in a third source.

Another important consideration is that each Dimension subclass needs to have a private collection of instances. If we trivially put an _instances=[] in the class definition, this single list object will be shared by all subclasses. To prevent sharing this list, we must tweak class definition slightly to assure that each subclass is given its own list object to initialize the _instances class-level variable.

Here's a metaclass which initializes the _instances variable in each class that's built:

class DimensionMeta(type): def __new__(cls, name, bases, dict): new_class= super().__new__(cls, name, bases, dict) new_class._instances = list() return new_class

We've overridden the __new__() constructor for a class to insert an extra initialization. This allows us to define a subclass without having to explicitly create the a Class._instances list.

We'll use this metaclass to define the Dimension superclass. This superclass will be used for each individual dimension that we find in our appropriations and salary data. Here's the first part of the class definition:

import weakref class Dimension(metaclass=DimensionMeta): def __init__( self, aRow_ns ): for n in self.__slots__: value= getattr(aRow_ns, n, None) setattr( self, n, value ) self._instances.append( weakref.ref(self) ) def __repr__( self ): slots= ", ".join( "{!s}={!r}".format(n,getattr(self,n)) for n in self.__slots__ ) return "{!s}({!s})".format( self.__class__.__name__, slots )

We've defined the __init__() method that will use an input SimpleNamespace object to create the dimension entity. We'll rely on the __slots__ variable to define the attributes of this class. In some cases, each row has all of the right attributes. In other cases, we'll set as many of the attributes as we can from one source, and update attributes from subsequent sources as we load our warehouse.

This design requires the dimension attribute names to precisely match the source row attribute names. This is not always desirable, but in this case, it seems to be a good design choice. It simplifies the constructor.

As noted above, this allows us to use Department(row) to create an object. Each Dimension subclass will have the detailed mapping information encapsulated entirely in the class.

Here's the suite of class methods that we can use to locate items within the _instances list:

@classmethod def get( class_, **kw ): """Returns first match.""" for match in class_.get_iter(**kw): return match @classmethod def get_iter( class_, **kw ): """Returns all matches.""" for obj_ref in class_._instances: obj= obj_ref() if all( kw[n] == getattr(obj,n,None) for n in kw ): yield obj @classmethod def all( class_ ): """Returns all.""" return (obj_ref() for obj_ref in class_._instances)

We've defined three views into the collection of instances. The get() method does a simplistic search, returning the first match. The get_iter() method yields all of the matches. The all() method simply iterates through the contained collection.

Note that each object has a strong reference to the class which defines it. Therefore, each class should have a weakref to the various objects. This may prevent problems with garbage collection.

The search algorithm shown in get_iter() is very slow. A subclass should implement a proper mapping of some kind to speed this up. Each such mapping is dependent on the kind of data and the way that individual objects are identified in each data source.

Designing a star schema for appropriations

When we look at the appropriation columns, we can see the following structure. This data appears to have four dimensions:

  • A Fund, which is described by three columns with names of fund_type, fund_code, and fund_description.
  • A City Department, which is described by two column named department_number and department_description.
  • An Appropriation Authority, described by the columns named appropriation_authority and appropriation_authority_description.
  • An Appropriation Account. This has two columns, named appropriation_account and appropriation_account_description.

The lonely fact is the column named _ordinance_amount_. It's not clear why this name has extra punctuation, but that's what in the fieldName attribute of the metadata, and it's an acceptable, legal Python name, so we'll use it.

As with many real-world problems, we don't know precisely what all of these columns mean. We don't need to know the nuances. What we have here is enough to create a usable data model for this dataset.

Here's a collection of four Python class definitions that captures the dimensions of the star schema:

class Fund(Dimension): __slots__ = ('fund_type', 'fund_code', 'fund_description') class Department(Dimension): __slots__ = ('department_number', 'department_description') class Authority(Dimension): __slots__ = ('appropriation_authority', 'appropriation_authority
_description') class Account(Dimension): __slots__ = ('appropriation_account', 'appropriation_account

Each of these dimensions is a simple extension to the Dimension base class. We've provided the attribute names in the __slots__ definition. These names are required to match the source field names.

Each Appropriation has four dimensions and one fact. Here's the definition for an overall Appropriation fact object:

class Appropriation: def __init__( self, fund, department, authority, account, amount ): fund self.department= department self.authority= authority self.account= account self.amount= amount def __repr__(self): return "{fund} {department} {authority} {account} {amount:n}"

Unlike the dimension constructors, which use a source row, this constructor accepts the specific dimension objects, plus a usable numeric amount. Since we're working with currency, the amount must be of the decimal.Decimal class.

This allows us to build the overall star schema "fact table" as a class that is an extension to a list:

from decimal import Decimal class Appropriations(list): def load(self, reader): for row in reader: fund= Fund.get(fund_code=row.fund_code) or Fund(row) dept= ( Department.get(department_number=row.department_number) or Department(row)) app_auth= ( Authority.get(appropriation_authority=row.appropriation_
authority) or Authority(row)) acct= ( Account.get(appropriation_account=row.appropriation_
account) or Account(row)) appropriation= Appropriation( fund, dept, app_auth, acct, Decimal(row._ordinance_amount_) ) self.append( appropriation ) return self

This class definition will build the Appropriation instances from a source namespace reader, and returns itself as a proper list object. Each dimension is built using our canonical Dim.get() or Dim(row) pattern. The fact is built from each dimension object plus the amount converted to a Decimal value.

We can gather the data with a simple function like the following:

def load( facts=Appropriations(), filename="budget_appropriations.json" ): with open( filename ) as source: fact_table= facts.load( ns_reader(source) ) return fact_table

The two arguments are the fact table to build and the source file to build it from. We've opened the file, loaded the fact table, and then returned the resulting list.

The elapsed run time for this load? The 2014 budget data has about 3600 appropriation facts; this loads in 0.53 seconds on a MacBook laptop.

Reporting data in groups

We can use the list object created by our Appropriations.load() method to do a variety of summary queries. We can use a defaultdict or Counter collection to gather summaries from the fact collection. We can filter the rows to pass or reject facts, or we can split the data into subsets.

Here's a function to produce a simple report with totals grouped by department.

def report(appropriations): by_department = Counter() for app in appropriations: by_department[app.department] += app.amount for dept, amount in by_department.most_common(): print( "{0:35s} ${1:13n}".format(dept.department_description,
amount) ) grand= sum( a.amount for a in appropriations ) print( "{0:35s} ${1:13n}".format("", grand) )

We've built a Counter instance to accumulate totals. For each appropriation, we've added the amount into the Counter object, using the Department dimension object as the key. Finally, we've used the Counter.most_common() method to report the totals in descending order.

This function also computes a grand total from the original applications. We could also have (slightly more quickly) computed a grand total from sum(by_department.values()).

This is a kind of "slice and dice" report: we've sliced the data along the department dimension. Slicing by other dimensions is a matter of changing the dimension object used and changing the label(s) used for those dimension objects in the output.

We can easily add an if statement inside the for loop to act as a filter which passes or rejects specific departments. We could also accumulate multiple group-level totals concurrently, if that was necessary. We would do this by creating more Counter instances so that one trip through the data would accumulate into several Counter objects

Slicing and dicing data in a two-dimensional table

Another common requirement is a report that's two dimensional. Perhaps we want to see department vs. account as a tabular display. This is a relatively minor change to the accumulation of the data. It's a relatively complex change to the display of the output, however, since the number of columns being displayed may be rather large.

Here's a function to produce a cross-tab report on the appropriations data:

def cross_tab( appropriations ): totals = Counter() for app in appropriations: totals[app.department, app.authority] += app.amount dept = list(sorted(set( d for d,a in totals.keys() ), key= lambda d:d.department_description)) auth = list(sorted(set( a for d,a in totals.keys() ), key= lambda a:a.appropriation_authority)) print( ",".join(["department"] +[a.appropriation_authority for a in auth]) ) for d in dept: print( ",".join( [d.department_description] +[str(totals[d,a]) for a in auth] ) )

This function begins with a structure that's nearly identical to the previous report. We've created a Counter collection. We've accumulated totals based on two dimensions instead of one; in this case the Department dimension and the Authority dimension are used to create a two-tuple as a key to the counter's mapping.

To determine the row and column headings of the report, we need to determine the domain for departments and authorities. We can gather this information from the keys in the Counter collection. Each key is a two-tuple; the expression (d for d,a in totals.keys()) will be the list of departments. Similarly, ( a for d,a in totals.keys() ) will the list of authorities. Note that the Counter collection may not contain all possible combinations; the actual data is quite sparse.

To produce user-friendly output, we've sorted the departments into ascending order by the department_description attribute. We've also sorted the authorities into ascending order by their mystic appropriation_authority code. Once we have both of these dimensions in their proper order, we can iterate through all combinations of Department object, d, and Authority object, a.

For this output, we've stuffed in "," strings, as though we're creating a shabby CSV output. Creating a proper CSV file is a small expansion to this function. We can create a CSV writer object and the csv.writerow() method instead of the print() function.

Indexing the facts by dimension

We can design certain kinds of high-performance slicing and dicing operations by creating dimension-specific indices. The principle design pattern here is to an attribute to a dimension which is a list of facts.

As we build each Appropriation fact, we'll keep two references to it.

  • One reference is in the Appropriations "fact table": the master list of all facts.
  • An additional reference in the dimension's list of associated facts.

We can build this additional reference during the load with something like fund.facts.append(appropriation). If we provide the list of related facts with each dimension, we can filter a dimension-related subset without having to examine all of the facts.

Expanding to include the salary star schema

If we look at the salary data, we see that it's a bit more complex. This appears to have we have as many as seven dimensions and two measures for each fact. Here are the various dimensions that appear at first blush.

  • A Fund as shown above. This is a common dimension between the two facts.
  • A City Department, also seen before; also a common dimension. However, in this dataset, the column names don't precisely match the other dataset.
  • A City Organization, which appears to be a hierarchy with both an Organization and a Division. The Organization described by organization_code and organization_description. The Division has division_code and division_description.
  • A Section and a Sub-Section defined by section_code, section_description, sub_section_code, and sub_section_description. This might be a separate hierarchy, orthogonal to organization-division.
  • Schedule / Grade and Bargaining Unit. These are defined by schedule_grade and bargaining_unit columns. These appear to be part of collective bargaining or union agreements.
  • The job title has two parts: title_code and title_description.

There may be a proper hierarchical relationship between Organization, Division, Section, and Subsection, but the data shows 33 duplicated uses of a given Section within the Divisions. Clearly, we'd need some inside information to interpret this data. With Python, we can model them separately and modify the model as we learn more.

The following items are part of the measurements in the fact table that goes with these dimensions.

  • The column budgeted_unit has a domain of values of "annual", "monthly", and "hourly".
  • The column total_budgeted_unit is a number; it appears to be the number of people multiplied by the number of budgeted units in the budget cycle. For example, a unit of "hourly" and a total of 2080 appears to be one person times 40 hours per week times 52 weeks.
  • The position_control column has just two values: 0 and 1. The value of zero seems to indicate a position that's not one person filling a job: hourly or temporary workers can be used. The value of 1 seems to show that this is a "position" for a permanent employee.
  • The two measures seem to be budgeted_pay_rate and total_budgeted_amount. Both of these seem to be derived from the other three values.

We can see two typical cases for the salary calculation.

  • The title code of 463F has the following budget fact data: ['Hourly', '2080', '0', '45.84', '95347']. We can interpret this as follows. It's not a whole position; the position control is zero. The total budgeted units are 2080, the budgeted pay rate is $45.84: 2080×$45.84=$95347.20.
  • The title code if 4223 has the following budget fact data: ['Hourly', '18', '1', '12.5', '468000']. We can interpret this as follows. This describes a number of whole positions; position control is one. The total budgeted units is 18, that is 18 full-time positions, the pay rate is $12.50; 18×$12.5×2080=$468000.00. The 2080 is the conversion factor between full-time positions and hourly positions.

Both of these rules unify to the following method function of a Salary class:

def total(self): """Rounded to nearest whole.""" with decimal.localcontext() as ctx: ctx.rounding= decimal.ROUND_UP factor= {'Annual':1, 'Monthly':12,
'Hourly':2080}[self.budgeted_unit] units= (factor*self.total_budgeted_unit if self.position_control == 1 else (self.total_budgeted_unit or 1)) return (units*self.budgeted_pay_rate).quantize(DOLLAR)

This method function will properly compute the total salary from the details on each fact record. Interestingly, there are a few discrepancies between the total as supplied in the official budget dataset and the totals computed by this method.

What does it mean when there are discrepancies between the raw data and the pre-calculated results? What if the discrepancies amount to tens of millions of dollars? It's hard to say, but with these analytical tools, we're in a position to ask more penetrating questions. Because we're using Python, we can easily modify the calculation as we learn more about the exceptions and special cases in the data.

To incorporate both salary and appropriation datasets, we'll need to expand our definition of Department. This means that our process for loading salary data must provide the expanded definition of Department. It also means that our reporting may need to reflect multiple-level hierarchies.


Our goal is to use Python is to work with sets of data that are large or complex, doing an analysis that's also rather complex. In this article, we've looked at loading some data that can be conformed to a star schema design pattern. We've also looked at producing grouped summaries and cross-tabulated summaries.

These kinds of analysis are just the tip of the iceberg. Considerably more complex processing is available in the Python standard libraries as well as the numerous add-on libraries. We're not constrained by a limited expression language in a spreadsheet. Nor are we constrained by the limited row-column model of spreadsheet and relational database.

In order to show some of this analysis, we've created simple functions, generator functions, definitions for metaclasses as well as definitions of more ordinary classes. We've used a number of built-in data types like the set, list, dict, and tuple classes. We've also used some standard library components like the collections, json, and decimal modules.

We've created an extremely high-performance star-schema data model with only a few lines of code. We can retain large amounts of data in most modern computers; this means we can do complex analysis very quickly. With such powerful tools at our disposal, we can consider doing even more sophisticated and useful analysis of available data.

Resources for Article:

Further resources on this subject:

You've been reading an excerpt of:

Mastering Object-oriented Python

Explore Title