Advanced Output Formats in Python 2.6 Text Processing

Jeff McNeil

December 2010


Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide

The easiest way to learn how to manipulate text with Python

  • The easiest way to learn text processing with Python
  • Deals with the most important textual data formats you will encounter
  • Learn to use the most popular text processing libraries available for Python
  • Packed with examples to guide you through
        Read more about this book      

(For more resources on this subject, see here.)

We'll not dive into too much detail with any single approach. Rather, the goal of this article is to teach you the basics such that you can get started and further explore details on your own. Also, remember that our goal isn't to be pretty; it's to present a useable subset of functionality. In other words, our PDF layouts are ugly!

Unfortunately, the third-party packages used in this article are not yet compatible with Python 3. Therefore, the examples listed here will only work with Python 2.6 and 2.7.

Dealing with PDF files using PLATYPUS

The ReportLab framework provides an easy mechanism for dealing with PDF files. It provides a low-level interface, known as pdfgen, as well as a higher-level interface, known as PLATYPUS. PLATYPUS is an acronym, which stands for Page Layout and Typography Using Scripts. While the pdfgen framework is incredibly powerful, we'll focus on the PLATYPUS system here as it's slightly easier to deal with. We'll still use some of the lower-level primitives as we create and modify our PLATYPUS rendered styles.

The ReportLab Toolkit is not entirely Open Source. While the pieces we use here are indeed free to use, other portions of the library fall under a commercial license. We'll not be looking at any of those components here. For more information, see the ReportLab website, available at

Time for action – installing ReportLab

Like all of the other third-party packages we've installed thus far, the ReportLab Toolkit can be installed using SetupTools' easy_install command. Go ahead and do that now from your virtual environment. We've truncated the output that we are about to see in order to conserve on space. Only the last lines are shown.

(text_processing)$ easy_install reportlab

What just happened?

The ReportLab package was downloaded and installed locally. Note that some platforms may require a C compiler in order to complete the installation process. To verify that the packages have been installed correctly, let's simply display the version tag.

(text_processing)$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import reportlab
>>> reportlab.Version

Generating PDF documents

In order to build a PDF document using PLATYPUS, we'll arrange elements onto a document template via a flow. The flow is simply a list element that contains our individual document components. When we finally ask the toolkit to generate our output file, it will merge all of our individual components together and produce a PDF.

Time for action – writing PDF with basic layout and style

In this example, we'll generate a PDF that contains a set of basic layout and style mechanisms. First, we'll create a cover page for our document. In a lot of situations, we want our first page to differ from the remainder of our output. We'll then use a different format for the remainder of our document.

  1. Create a new Python file and name it Copy the following code as it appears as follows:

    import sys
    from report lab.PLATYPUS import SimpleDocTemplate, Paragraph
    from reportlab.PLATYPUS import Spacer, PageBreak
    from reportlab.lib.styles import getSampleStyleSheet
    from reportlab.rl_config import defaultPageSize
    from reportlab.lib.units import inch

    from reportlab.lib import colors

    class PDFBuilder(object):
    HEIGHT = defaultPageSize[1]
    WIDTH = defaultPageSize[0]

    def _intro_style(self):
    """Introduction Specific Style"""
    style = getSampleStyleSheet()['Normal']
    style.fontName = 'Helvetica-Oblique'
    style.leftIndent = 64
    style.rightIndent = 64
    style.borderWidth = 1
    style.borderColor =
    style.borderPadding = 10
    return style

    def __init__(self, filename, title, intro):
    self._filename = filename
    self._title = title
    self._intro = intro
    self._style = getSampleStyleSheet()['Normal']
    self._style.fontName = 'Helvetica'

    def title_page(self, canvas, doc):
    Write our title page.

    Generates the top page of the deck,
    using some special styling.
    canvas.setFont('Helvetica-Bold', 18)
    self.WIDTH/2.0, self.HEIGHT-180, self._title)
    canvas.setFont('Helvetica', 12)

    def std_page(self, canvas, doc):
    Write our standard pages.
    canvas.setFont('Helvetica', 9)
    canvas.drawString(inch, 0.75*inch, "%d" %

    def create(self, content):
    Creates a PDF.

    Saves the PDF named in self._filename.
    The content parameter is an iterable; each
    line is treated as a standard paragraph.
    document = SimpleDocTemplate(self._filename)
    flow = [Spacer(1, 2*inch)]

    # Set our font and print the intro
    # paragraph on the first page.
    Paragraph(self._intro, self._intro_style()))

    # Additional content
    for para in content:
    Paragraph(para, self._style))
    # Space between paragraphs.
    flow.append(Spacer(1, 0.2*inch))
    flow, onFirstPage=self.title_page,
    if __name__ == '__main__':
    if len(sys.argv) != 5:
    print "Usage: %s <output> <title> <intro file> <content
    file>" % \

    # Do Stuff
    builder = PDFBuilder(
    sys.argv[1], sys.argv[2], open(sys.argv[3]).read())
    # Generate the rest of the content from a text file
    # containing our paragraphs.

  2. Next, we'll create a text file that will contain the introductory paragraph. We've placed it in a separate file so it's easier to manipulate. Enter the following into a text file named intro.txt.
    This is an example document that we've created from scratch; it has no story to tell. It's purpose? To serve as an example.
  3. Now, we need to create our PDF content. Let's add one more text file and name paragraphs.txt. Feel free to create your own content here. Each new line will start a new paragraph in the resulting PDF. Our test data is as follows:
    This is the first paragraph in our document and it really serves no meaning other than example text.
    This is the second paragraph in our document and it really serves no meaning other than example text.
    This is the third paragraph in our document and it really serves no meaning other than example text.
    This is the fourth paragraph in our document and it really serves no meaning other than example text.
    This is the final paragraph in our document and it really serves no meaning other than example text.
  4. Now, let's run the PDF generation script

    (text_processing)$ python output.pdf "Example
    Document" intro.txt paragraphs.txt

  5. If you view the generated document in a reader, the generated pages should resemble the following screenshots:

The preceding screenshot displays the clean Title page, which we derive from the commandline arguments and the contents of the introduction file. The next screenshot contains document copy, which we also read from a file.

What just happened?

We used the ReportLab Toolkit to generate a basic PDF. In the process, you created two different layouts: one for the initial page and one for subsequent pages. The first page serves as our title page. We printed the document title and a summary paragraph. The second (and third, and so on) pages simply contain text data.

At the top of our code, as always, we import the modules and classes that we'll need to run our script. We import SimpleDocTemplate, Paragraph, Spacer, and Pagebreak from the PLATYPUS module. These are items that will be added to our document flow.

Next, we bring in getSampleStyleSheet. We use this method to generate a sample, or template, stylesheet that we can then change as we need. Stylesheets are used to provide appearance instructions to Paragraph objects here, much like they would be used in an HTML document.

The last two lines import the inch size as well as some page size defaults. We'll use these to better lay out our content on the page. Note that everything here outside of the first line is part of the more general-purpose portion of the toolkit.

The bulk of our work is handled in the PDFBuilder class we've defined. Here, we manage our styles and hide the PDF generation logic. The first thing we do here is assign the default document height and width to class variables named HEIGHT and WIDTH, respectively. This is done to make our code easier to work with and to make for easier inheritance down the road.

The _intro_style method is responsible for generating the paragraph style information that we use for the introductory paragraph that appears in the box. First, we create a new stylesheet by calling getSampleStyleSheet. Next, we simply change the attributes that we wish to modify from default.

The values in the preceding table define the style used for the introductory paragraph, which is different from the standard style. Note that this is not an exhaustive list; this simply details the variables that we've changed.

Next we have our __init__ method. In addition to setting variables corresponding to the arguments passed, we also create a new stylesheet. This time, we simply change the font used to Helvetica (default is Times New Roman). This will be the style we use for default text.

The next two methods, title_page and std_page, define layout functions that are called when the PDF engine generates both the first and subsequent pages. Let's walk through the title_page method in order to understand what exactly is happening.

First, we save the current state of the canvas. This is a lower-level concept that is used throughout the ReportLab Toolkit. We then change the active font to a bold sans serif at 18 point. Next, we draw a string at a specific location in the center of the document. Lastly, we restore our state as it was before the method was executed.

If you take a quick look at std_page, you'll see that we're actually deciding how to write the page number. The library isn't taking care of that for us. However, it does help us out by giving us the current page number in the doc object.

Neither the std_page nor the title_page methods actually lay the text out. They're called when the pages are rendered to perform annotations. This means that they can do things such as write page numbers, draw logos, or insert callout information. The actual text formatting is done via the document flow.

The last method we define is create, which is responsible for driving title page creation and feeding the rest of our data into the toolkit. Here, we create a basic document template via SimpleDocTemplate. We'll flow all of our components onto this template as we define them.

Next, we create a list named flow that contains a Spacer instance. The Spacer ensures we do not begin writing at the top of the PDF document.

We then build a Paragraph containing our introductory text, using the style built in the self._intro_style method. We append the Paragraph object to our flow and then force a page break by also appending a PageBreak object.

Next, we iterate through all of the lines passed into the method as content. Each generates a new Paragraph object with our default style.

Finally, we call the build method of the document template object. We pass it our flow and two different methods to be called - one when building the first page and one when building subsequent pages.

Our __main__ section simply sets up calls to our PDFBuilder class and reads in our text files for processing.

The ReportLab Toolkit is very heavily documented and is quite easy to work with. For more information, see the documents available at There is also a code snippets library that contains some common PDF recipes.

Have a go hero – drawing a logo

The toolkit provides easy mechanisms for including graphics directly into a PDF document. JPEG images can be included without any additional library support. Using the documentation referenced earlier, alter our title_page method such that you include a logo image below the introductory paragraph.

Writing native Excel data

Here, we'll look at an advanced technique that actually allows us to write actual Excel data (without requiring Microsoft Windows). To do this, we'll be using the xlwt package.

Time for action – installing xlwt

Again, like the other third-party modules we've installed thus far, xlwt can be downloaded and installed via the easy_install system. Activate your virtual environment and install it now. Your output should resemble the following:

(text_processing)$ easy_install xlwt

What just happened?

We installed the xlwt packages from the Python Package Index. To ensure your install worked correctly, start up Python and display the current version of the xlwt libraries.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import xlwt
>>> xlwt.__VERSION__

At the time of this writing, the xlwt module supports the generation of Excel xls format files, which are compatible with Excel 95 – 2003 (and later). MS Office 2007 and later utilizes Open Office XML (OOXML).

        Read more about this book      

(For more resources on this subject, see here.)

Building XLS documents

In this example, we'll build on our CSV examples(out of the scope of this article). If you'll recall, the first example from that chapter read in a CSV file containing revenue and cost numbers. The script output was simply the profit for each set of inputs. Here, we'll update our approach and generate a spreadsheet using formulas directly.

Time for action – generating XLS data

In this example, we'll reuse the Worksheet1.csv file. Copy the file over to your current directory now.

  1. Create a new Python file and name it Enter the following code as follows:

    import csv
    import sys
    import xlwt
    from xlwt.Utils import rowcol_to_cell

    from optparse import OptionParser

    def render_header(ws, fields, first_row=0):
    Generate an Excel Header.

    Builds a header line using different
    fonts from the default.
    header_style = xlwt.easyxf(
    'font: name Helvetica, bold on')
    col = 0
    for hdr in fields:
    ws.write(first_row, col, hdr, header_style)
    col += 1
    return first_row + 2

    if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option('-f', '--file', help='CSV Data File')
    parser.add_option('-o', '--output', help='Output XLS File')
    opts, args = parser.parse_args()

    if not opts.file or not opts.output:
    parser.error('Input source and output XLS required')
    # Create a dict reader from an open file
    # handle and iterate through rows.
    reader = csv.DictReader(open(opts.file, 'rU'))
    headers = [field for field in reader.fieldnames if field]

    workbook = xlwt.Workbook()
    sheet = workbook.add_sheet('Cost Analysis')

    # Returns the row that we'll start at
    # going forward.
    row = render_header(sheet, headers)

    for day in reader:
    sheet.write(row, 0, day['Date'])
    sheet.write(row, 1, day['Revenue'])
    sheet.write(row, 2, day['Cost'])
    sheet.write(row, 3,
    xlwt.Formula('%s-%s' % (rowcol_to_cell(row, 1),
    rowcol_to_cell(row, 2))))
    row += 1
    # Save workbook

  2. Now, run the command with the following options. It should generate a profit.xls in your current working directory.

    (text_processing)$ python ./ -f Workbook1.csv -o

  3. Opening the newly created profit.xls file. It should resemble the following screenshot. Yes, there is a problem with the rendered data. We'll clean that up in just a little bit.
  4. Now, select a revenue value or a cost value and update. Take note of the profit column and see how it changes as we update our values.

What just happened?

We just updated our example so that it outputs Excel data rather than printing plain text to standard output! Additionally, we incorporated the generation of Excel formulas such that our resulting spreadsheet supports dynamic profit calculation. We were able to do all of this with just a few trivial changes to our existing script.

Lets take a look at exactly how we did it.

First of all, we imported the required modules. In this case, we brought in the xlwt package as well as xlwt.Utils.rowcol_to_cell. The former provides the majority of the functionality while the latter allows us to translate numeric row and column coordinates into Excel-friendly number + letter locations.

Now, let's skip down to the __main__ section and follow our application's execution path. We added an additional option, -o or –output, which contains the destination filename for our new Excel file. We've then updated our parameter checking to ensure both are passed on the command line.

The next relevant changes occur with the following line of code.

headers = [field for field in reader.fieldnames if field]

Here, we pull all of our headers from the CSV data and strip out anything that doesn't evaluate to True. Why did we do this? Simple. If any empty cells made their way into our CSV data, we wouldn't want to include them as empty column headings in our output document.

Note that we also append the string Profit to our header list. We'll be the corresponding values in just a bit.

Next, we build our workbook. The xlwt package makes this quite easy. It only takes two lines to create a workbook and assign the first worksheet:

workbook = xlwt.Workbook()
sheet = workbook.add_sheet('Cost Analysis')

Here, we're creating a new workbook and then adding a sheet named Cost Analysis to it. If you look at the screenshot earlier in this article, you'll see that this is the name given to the tab at the bottom of the spreadsheet.

Now, we call a function we've defined named render_header and pass our sheet object to it with the list of headers we want to create. Looking at render_header, you'll notice that we first create a specific header style using xlwt.easyxf. This factory function takes a string definition of the style to be associated with a cell and returns an appropriate styling object.

Next, we simply iterate through all of our header columns and add them to the document using ws.write. Here, ws is the worksheet object we passed in to render_header.

One thing to note here is that the write method doesn't accept standard Excel cell names. Here, we need to pass in integer coordinates. Additionally, the data type of the inserted cell corresponds to the Python data type written. In this case, each value of hdr is a string. The result? These are all string columns in the final document.

We return the position of the first row with two added. This gives us a good logical place to start inserting our real data. We allowed the caller to pass in a starting height in order to provide just a bit more flexibly and reuse.

After our header has been rendered, we iterate through each row in our parsed CSV data and write the values to the sheet in order verbatim. There's no data translation happening at all.

Note the xlwt.Formula call. This is where we insert an Excel formula directly into our generated content. As the formula will be embedded, we need to translate from our numeric row and column syntax to the Excel syntax. This is done via our call to rowcol_to_cell. The following snippet shows how this is done:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> from xlwt.Utils import rowcol_to_cell
>>> rowcol_to_cell(1,6)
>>> rowcol_to_cell(0,6)

Finally, we save our new spreadsheet until the name passed on the command line with the embedded formula.

For more information, see the xlwt documentation, available at This documentation isn't entirely complete, thus it's probably a good exercise to spend some time reading the source code if you intend to use xlwt in a production scenario. Also, note that xlwt has a read-focused counterpart, xlrd.

Working with OpenDocument files

OpenDocument files are generally just ZIP bundles that contain a collection of XML files, which define the document. At the lowest level, it's possible to parse and edit the XML data directly; however, that requires an intricate knowledge of the relevant schemas and XML elements. A couple of packages exist that abstract out the implementation details. Here, we'll look at the odfpy package.

If you need to define and generate a large number of ODF files, I also suggest that you look at the apply.pod framework, which is available at It provides an OpenDocument-based templating system that allows you to embed Python code. It's a little advanced for our purposes, though.

OpenDocument files are understood by the OpenOffice package, as well as Microsoft Office 2007 and later. However, it's important to understand that OpenDocument is different than Microsoft's OXML format (docx, xlsx).

Time for action – installing ODFPy

Again, we'll simply be using easy_install to add this third-party package to our virtual environment. Go ahead and do this now.

(text_processing)$ easy_install odfpy

What just happened?

Like we did with xslt and ReportlLab, we installed a third-party module. Take a minute to ensure the ODF libraries are installed correctly. We'll just start up Python and make sure we can import the top-level package.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import odf
>>> dir(odf)
['__builtins__', '__doc__', '__file__', '__loader__', '__name__', '__
package__', '__path__']

        Read more about this book      

(For more resources on this subject, see here.)

Building an ODT generator

The odfpy package that we're using is a fairly low-level package. It's possible to access the XML data directly if we so choose, though we won't be doing much of that here.

Here, we'll look at how you programmatically build and style an OpenDocument Text file, or an ODT, for short.

Time for action – generating ODT data

In this example, we'll build a self-documenting Python module. In fact, we'll use Python's powerful introspection functionality and the odfpy package to generate a formatted OpenDocument file that serves as API documentation!

If you haven't done so yet, take a couple of minutes and ensure you have OpenOffice installed. It is freely available at OpenOffice also handles common Microsoft Office formats.

  1. First, create a new file and name it
  2. Next, either copy over or enter the code as it appears in the ZIP bundle available for download on the Packt Publishing FTP site. We've left it out here in order to save on space.
  3. Run the code listing at the command line as follows. A new file named __main__.odt should appear in the current working directory.

    (text_processing)$ python

  4. Open the new document file in OpenOffice Writer. The contents should resemble the following screenshot:

What just happened?

We used the inspect module to generate a snapshot of the running file. We then used that information to generate an OpenDocument text file documenting the example. Let's step through and look at the relevant parts.

The first thing we did was import the required objects. All style information comes from the module. Here, we imported Style, TextProperties, and ParagraphProperties. A little bit more on these in a minute. Next, we imported OpenDocumentText from odf.opendocument. We're dealing with an OpenDocument text file, so this is all we'll need.

Lastly, we bring in P and Span. These are much like their HTML counterparts. The P function defines a paragraph class that acts as a single block of content, whereas the Span function defines an inline text snippet that can become part of a larger paragraph.

Next, we define three styles. Each style is then referenced later when we generate our document content. As stated earlier, the odfpy module is generally a wrapper around ElementTree objects, so this approach should feel somewhat familiar to you.

Let's take a closer look at one of the style definitions.

DOC_STYLE = Style(name='DOC_STYLE', family='paragraph')
color='#000000', fontsize='12pt', fontfamily='Helvetica'))
marginbottom='16pt', marginleft='14pt'))

Here, we create a Style element and name it DOC_STYLE. It has a family of paragraph. When we want to apply the style later, we'll need to refer to it by this name. The family attribute categorizes which type of element it will apply to. If you attempt to apply a style in a text family to an object created with P, it simply won't apply.

Next, we call addElement twice, each time passing in a new instance of a properties class. The TextProperties call sets the display information for the text rendered within an element implementing this style. The ParagraphProperties call sets properties that are unique to Paragraph generated elements.

The following table outlines the style options we used for paragraphs and text elements. This isn't an exhaustive list. For more information, see the odfpy documentation that is available at

Just in case you're slightly confused, each of the imported odfpy objects is a function. P, Style, ParagaphProperties. All functions. Calling them simply returns an instance of odf.element.Element, which is a lower-level XML construct.

Now, let's take a very brief tour of our module_members function. There's a little bit of magic going on here. In short, we introspect a Python module and yield information regarding each top-level function and class that it defines. The information yielded is contained in a namedtuple we defined previously. Python has some very powerful introspection abilities. For more information, point your browser at

Our ModuleDocumentor class does all of the ODT file generation. In the __init__ method, we set the output filename, create an empty document object, and call self._add_styles. If we look at self._add_styles, we see the following three lines:


In this step, we're adding the global styles we created earlier to our new document object so they can be referenced by content. Technically, we're simply adding the XML data defined in the style objects to the generated ODT XML data.

Now, skip on down to the __main__ section. We create an instance of ModuleDocumenter and pass it sys.modules['__main__'] and the string __main__. What does this mean? We're passing in an instance of the currently running module.

The build method of ModuleDocumenter is fairly simple. We iterate through all of the results yielded by the module_members generator and build our documentation. As you can see, we call self.doc.text.AddElement twice, once with the results of self._create_header and once with the results of the P function. Again, the addElement approach should remind you of some of the XML processing code we examined much earlier.

The _create_header method first creates a new paragraph element by calling P. Then, it concatenates two Span elements using two different style names: TYPE_STYLE and NAME_STYLE. This gives our paragraph headings the look seen in the text document screenshot. We then return the new paragraph.

The underlying XML generated is as follows (though this is not important, it may help with your overall understanding):

<text:span text:style-name="TYPE_STYLE">Type: </text:span>
<text:span text:style-name="NAME_STYLE">ModuleDocumenter</text:span>

After generating the section header, we build a standalone paragraph, which contains the contents of each docstring. We simply use a different style. In all cases, the text content was passed into the factory function as the value to the text keyword argument.

The XML generated for each docstring resembles the following.

<text:p text:style-name="DOC_STYLE">ObjectDesc(name, type, doc)</

Finally, we save our new ODT document by calling Note that we don't include a file extension. The save method automatically decides that for us based on the document type if the second argument is True.

The odfpy package can be somewhat confusing, and the documentation is slightly lacking. For more information, see the odfpy site at If you're attempting to write more serious OpenDocument files then reading the OpenDocument standard is very much recommended. It is available online at

Have a go hero – understanding ODF XML files

As we've said, the OpenDocument standard is XML-based. Each ODX file is nothing more than a ZIP-compressed set of XML files (more accurately, a JAR file). Take a break from the Python code for a minute and uncompress the example file we created in this article. You'll learn a lot from wading through the contents.


In this article, we took a broad survey of three popular advanced output options. We also pointed out external resources should you need to do any in-depth work with these file types.

Specifically, we touched on the building and styling of simple PDF files using the ReportLab Toolkit, managing native Excel documents as opposed to CSV, and managing and manipulating OpenDocument files like the ones used by OpenOffice.

As a bit of a bonus, you also learned a bit more about Python's powerful introspection abilities as we built a self-documenting application.

Further resources on this subject:

You've been reading an excerpt of:

Python 2.6 Text Processing: Beginners Guide

Explore Title