Switch to the store?

Python 2.6 Text Processing: Beginners Guide

More Information
  • Know the options available for processing text in Python
  • Parse JSON data that is often used as a data delivery mechanism on the Internet
  • Organize a log-processing application via modules and packages to make it more extensible
  • Perform conditional matches via look-ahead and look-behind assertions by using basic regular expressions
  • Process XML and HTML documents in a variety of ways based on the needs of your application
  • Implement callback methods to perform SAX processing and walk in-memory DOM structures
  • Understand Unicode, character encoding, internationalization, and localization
  • Lay out a Mako template-based project by using techniques such as template inheritance, additional tags, and custom filters
  • Install and use the Mako templating system to create your own Mako templates
  • Process a large number of e-mail messages using the Python standard library and index them with Nucular for fast searching
  • Fix common exceptions that occur while dealing with different types of text encoding
  • Build simple PDF output using the ReportLab toolkit's high-level PLATYPUS framework
  • Generate Microsoft Excel output using the xlwt module
  • Open and edit existing Open Document files to use them as template sources
  • Understand supporting functions and classes, such as the Python IO system and packaging components

For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python.

Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another.

The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

  • The easiest way to learn text processing with Python
  • Deals with the most important textual data formats you will encounter
  • Learn to use the most popular text processing libraries available for Python
  • Packed with examples to guide you through
Page Count 380
Course Length 11 hours 24 minutes
Date Of Publication 14 Dec 2010
Categorizing types of text data
Ensuring you have Python installed
Implementing a simple cipher
Time for action – implementing a ROT13 encoder
Time for action – processing as a filter
Time for action – skipping over markup tags
Supporting third-party modules
Time for action – installing SetupTools
Running a virtual environment
Time for action – configuring a virtual environment
Where to get help?
Parsing web server logs
Time for action – generating transfer statistics
Using objects interchangeably
Time for action – introducing a new log format
Accessing files directly
Time for action – accessing files directly
Time for action – handling compressed files
Accessing multiple files
Time for action – spell-checking HTML content
Accessing remote files
Time for action – spell-checking live HTML pages
Time for action – handling urllib 2 errors
Handling string IO instances
Understanding IO in Python 3
Understanding the basics of string object
Time for action – employee management
String formatting
Time for action – customizing log processor output
Time for action – adding status code data
Creating templates
Time for action – displaying warnings on malformed lines
Calling string object methods
Time for action – simple manipulation with string methods
Reading CSV data
Time for action – processing Excel formats
Time for action – CSV and formulas
Time for action – processing custom CSV formats
Writing CSV data
Time for action – creating a spreadsheet of UNIX users
Modifying application configuration files
Time for action – adding basic configuration read support
Time for action – relying on configuration value interpolation
Time for action – configuration defaults
Writing configuration data
Time for action – generating a configuration file
Reconfiguring our source
Time for action – creating an egg-based package
Working with JSON
Time for action – writing JSON data
Simple string matching
Time for action – testing an HTTP URL
Advanced pattern matching
Time for action – regular expression grouping
Implementing Python-specific elements
Time for action – reading DNS records
XML data
SAX processing
Time for action – event-driven processing
Time for action – driving incremental processing
Time for action – creating a dungeon adventure game
The Document Object Model
Time for action – updating our game to use DOM processing
Time for action – using XPath in our adventure
Reading HTML
Time for action – displaying links in an HTML page
Time for action – installing Mako
Basic Mako usage
Time for action – loading a simple Mako template
Time for action – reformatting the date with Python code
Time for action – defining Mako def tags
Time for action – converting mail message to use namespaces
Inheriting from base templates
Time for action – updating base template
Time for action – adding another inheritance layer
Time for action – creating custom Mako tags
Overviewing alternative approaches
Understanding basic character encodings
Encodings in Python
Time for action – manually decoding
Time for action – copying Unicode data
Time for action – fixing our copy application
The codecs module
Time for action – changing encodings
Adopting good practices
Internationalization and Localization
Time for action – preparing for multiple languages
Time for action – providing translations
Dealing with PDF files using PLATYPUS
Time for action – installing ReportLab
Time for action – writing PDF with basic layout and style
Writing native Excel data
Time for action – installing xlwt
Time for action – generating XLS data
Working with OpenDocument files
Time for action – installing ODFPy
Time for action – generating ODT data
Defining a language syntax
Time for action – installing PyParsing
Time for action – implementing a calculator
Time for action – handling type translations
Time for action – suppressing portions of a match
Processing data using the Natural Language Toolkit
Time for action – installing NLTK
Understanding search complexity
Time for action – implementing a linear search
Text indexing
Time for action – installing Nucular
Time for action – full text indexing
Time for action – measuring index benefit
Time for action – field-qualified indexes
Time for action – performing advanced Nucular queries
Indexing and searching other data
Time for action – indexing Open Office documents
Other index systems


Jeff McNeil

Jeff McNeil has been working in the Internet Services industry for over 10 years. He cut his teeth during the late 90's Internet boom and has been developing software for Unix and Unix-flavored systems ever since. Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl. He takes an interest in systems administration and server automation problems. Jeff recently joined Google and has had the pleasure of working with some very talented individuals.