Python 2.6 Text Processing: Beginners Guide

With a basic knowledge of Python you have the potential to undertake time-saving text processing. This book is a great introduction to the various techniques, and teaches through practical examples and clear explanations.

Python 2.6 Text Processing: Beginners Guide

Beginner's Guide
Jeff McNeil

With a basic knowledge of Python you have the potential to undertake time-saving text processing. This book is a great introduction to the various techniques, and teaches through practical examples and clear explanations.
$26.99
$44.99
RRP $26.99
RRP $44.99
eBook
Print + eBook
$12.99 p/month

Get Access

Get Unlimited Access to every Packt eBook and Video course

Enjoy full and instant access to over 3000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781849512121
Paperback380 pages

About This Book

  • The easiest way to learn text processing with Python
  • Deals with the most important textual data formats you will encounter
  • Learn to use the most popular text processing libraries available for Python
  • Packed with examples to guide you through

Who This Book Is For

This book is for people who have text in one format, and need it in another, as quickly as possible. You don't need any experience with text processing, but you will need some basic knowledge of Python.

Table of Contents

Chapter 1: Getting Started
Categorizing types of text data
Ensuring you have Python installed
Implementing a simple cipher
Time for action – implementing a ROT13 encoder
Time for action – processing as a filter
Time for action – skipping over markup tags
Supporting third-party modules
Time for action – installing SetupTools
Running a virtual environment
Time for action – configuring a virtual environment
Where to get help?
Summary
Chapter 2: Working with the IO System
Parsing web server logs
Time for action – generating transfer statistics
Using objects interchangeably
Time for action – introducing a new log format
Accessing files directly
Time for action – accessing files directly
Time for action – handling compressed files
Accessing multiple files
Time for action – spell-checking HTML content
Accessing remote files
Time for action – spell-checking live HTML pages
Time for action – handling urllib 2 errors
Handling string IO instances
Understanding IO in Python 3
Summary
Chapter 3: Python String Services
Understanding the basics of string object
Time for action – employee management
String formatting
Time for action – customizing log processor output
Time for action – adding status code data
Creating templates
Time for action – displaying warnings on malformed lines
Calling string object methods
Time for action – simple manipulation with string methods
Summary
Chapter 4: Text Processing Using the Standard Library
Reading CSV data
Time for action – processing Excel formats
Time for action – CSV and formulas
Time for action – processing custom CSV formats
Writing CSV data
Time for action – creating a spreadsheet of UNIX users
Modifying application configuration files
Time for action – adding basic configuration read support
Time for action – relying on configuration value interpolation
Time for action – configuration defaults
Writing configuration data
Time for action – generating a configuration file
Reconfiguring our source
Time for action – creating an egg-based package
Working with JSON
Time for action – writing JSON data
Summary
Chapter 5: Regular Expressions
Simple string matching
Time for action – testing an HTTP URL
Advanced pattern matching
Time for action – regular expression grouping
Implementing Python-specific elements
Time for action – reading DNS records
Summary
Chapter 6: Structured Markup
XML data
SAX processing
Time for action – event-driven processing
Time for action – driving incremental processing
Time for action – creating a dungeon adventure game
The Document Object Model
Time for action – updating our game to use DOM processing
XPath
Time for action – using XPath in our adventure
Reading HTML
Time for action – displaying links in an HTML page
Summary
Chapter 7: Creating Templates
Time for action – installing Mako
Basic Mako usage
Time for action – loading a simple Mako template
Time for action – reformatting the date with Python code
Time for action – defining Mako def tags
Time for action – converting mail message to use namespaces
Inheriting from base templates
Time for action – updating base template
Time for action – adding another inheritance layer
Customizing
Time for action – creating custom Mako tags
Overviewing alternative approaches
Summary
Chapter 8: Understanding Encodings and i18n
Understanding basic character encodings
Unicode
Encodings in Python
Time for action – manually decoding
Time for action – copying Unicode data
Time for action – fixing our copy application
The codecs module
Time for action – changing encodings
Adopting good practices
Internationalization and Localization
Time for action – preparing for multiple languages
Time for action – providing translations
Summary
Chapter 9: Advanced Output Formats
Dealing with PDF files using PLATYPUS
Time for action – installing ReportLab
Time for action – writing PDF with basic layout and style
Writing native Excel data
Time for action – installing xlwt
Time for action – generating XLS data
Working with OpenDocument files
Time for action – installing ODFPy
Time for action – generating ODT data
Summary
Chapter 10: Advanced Parsing and Grammars
Defining a language syntax
PyParsing
Time for action – installing PyParsing
Time for action – implementing a calculator
Time for action – handling type translations
Time for action – suppressing portions of a match
Processing data using the Natural Language Toolkit
Time for action – installing NLTK
Summary
Chapter 11: Searching and Indexing
Understanding search complexity
Time for action – implementing a linear search
Text indexing
Time for action – installing Nucular
Time for action – full text indexing
Time for action – measuring index benefit
Time for action – field-qualified indexes
Time for action – performing advanced Nucular queries
Indexing and searching other data
Time for action – indexing Open Office documents
Other index systems
Summary

What You Will Learn

  • Know the options available for processing text in Python
  • Parse JSON data that is often used as a data delivery mechanism on the Internet
  • Organize a log-processing application via modules and packages to make it more extensible
  • Perform conditional matches via look-ahead and look-behind assertions by using basic regular expressions
  • Process XML and HTML documents in a variety of ways based on the needs of your application
  • Implement callback methods to perform SAX processing and walk in-memory DOM structures
  • Understand Unicode, character encoding, internationalization, and localization
  • Lay out a Mako template-based project by using techniques such as template inheritance, additional tags, and custom filters
  • Install and use the Mako templating system to create your own Mako templates
  • Process a large number of e-mail messages using the Python standard library and index them with Nucular for fast searching
  • Fix common exceptions that occur while dealing with different types of text encoding
  • Build simple PDF output using the ReportLab toolkit's high-level PLATYPUS framework
  • Generate Microsoft Excel output using the xlwt module
  • Open and edit existing Open Document files to use them as template sources
  • Understand supporting functions and classes, such as the Python IO system and packaging components

In Detail

For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python.

Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another.

The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

Authors

Table of Contents

Chapter 1: Getting Started
Categorizing types of text data
Ensuring you have Python installed
Implementing a simple cipher
Time for action – implementing a ROT13 encoder
Time for action – processing as a filter
Time for action – skipping over markup tags
Supporting third-party modules
Time for action – installing SetupTools
Running a virtual environment
Time for action – configuring a virtual environment
Where to get help?
Summary
Chapter 2: Working with the IO System
Parsing web server logs
Time for action – generating transfer statistics
Using objects interchangeably
Time for action – introducing a new log format
Accessing files directly
Time for action – accessing files directly
Time for action – handling compressed files
Accessing multiple files
Time for action – spell-checking HTML content
Accessing remote files
Time for action – spell-checking live HTML pages
Time for action – handling urllib 2 errors
Handling string IO instances
Understanding IO in Python 3
Summary
Chapter 3: Python String Services
Understanding the basics of string object
Time for action – employee management
String formatting
Time for action – customizing log processor output
Time for action – adding status code data
Creating templates
Time for action – displaying warnings on malformed lines
Calling string object methods
Time for action – simple manipulation with string methods
Summary
Chapter 4: Text Processing Using the Standard Library
Reading CSV data
Time for action – processing Excel formats
Time for action – CSV and formulas
Time for action – processing custom CSV formats
Writing CSV data
Time for action – creating a spreadsheet of UNIX users
Modifying application configuration files
Time for action – adding basic configuration read support
Time for action – relying on configuration value interpolation
Time for action – configuration defaults
Writing configuration data
Time for action – generating a configuration file
Reconfiguring our source
Time for action – creating an egg-based package
Working with JSON
Time for action – writing JSON data
Summary
Chapter 5: Regular Expressions
Simple string matching
Time for action – testing an HTTP URL
Advanced pattern matching
Time for action – regular expression grouping
Implementing Python-specific elements
Time for action – reading DNS records
Summary
Chapter 6: Structured Markup
XML data
SAX processing
Time for action – event-driven processing
Time for action – driving incremental processing
Time for action – creating a dungeon adventure game
The Document Object Model
Time for action – updating our game to use DOM processing
XPath
Time for action – using XPath in our adventure
Reading HTML
Time for action – displaying links in an HTML page
Summary
Chapter 7: Creating Templates
Time for action – installing Mako
Basic Mako usage
Time for action – loading a simple Mako template
Time for action – reformatting the date with Python code
Time for action – defining Mako def tags
Time for action – converting mail message to use namespaces
Inheriting from base templates
Time for action – updating base template
Time for action – adding another inheritance layer
Customizing
Time for action – creating custom Mako tags
Overviewing alternative approaches
Summary
Chapter 8: Understanding Encodings and i18n
Understanding basic character encodings
Unicode
Encodings in Python
Time for action – manually decoding
Time for action – copying Unicode data
Time for action – fixing our copy application
The codecs module
Time for action – changing encodings
Adopting good practices
Internationalization and Localization
Time for action – preparing for multiple languages
Time for action – providing translations
Summary
Chapter 9: Advanced Output Formats
Dealing with PDF files using PLATYPUS
Time for action – installing ReportLab
Time for action – writing PDF with basic layout and style
Writing native Excel data
Time for action – installing xlwt
Time for action – generating XLS data
Working with OpenDocument files
Time for action – installing ODFPy
Time for action – generating ODT data
Summary
Chapter 10: Advanced Parsing and Grammars
Defining a language syntax
PyParsing
Time for action – installing PyParsing
Time for action – implementing a calculator
Time for action – handling type translations
Time for action – suppressing portions of a match
Processing data using the Natural Language Toolkit
Time for action – installing NLTK
Summary
Chapter 11: Searching and Indexing
Understanding search complexity
Time for action – implementing a linear search
Text indexing
Time for action – installing Nucular
Time for action – full text indexing
Time for action – measuring index benefit
Time for action – field-qualified indexes
Time for action – performing advanced Nucular queries
Indexing and searching other data
Time for action – indexing Open Office documents
Other index systems
Summary

Book Details

ISBN 139781849512121
Paperback380 pages
Read More

Recommended for You

Learning Python Data Visualization
$ 28.99