Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide
eBook: $26.99
Formats: PDF, PacktLib, ePub and Mobi formats
save 15%!
Print + free eBook + free PacktLib access to the book: $71.98    Print cover: $44.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Table of Contents
Sample Chapters
  • The easiest way to learn text processing with Python
  • Deals with the most important textual data formats you will encounter
  • Learn to use the most popular text processing libraries available for Python
  • Packed with examples to guide you through

Book Details

Language : English
Paperback : 380 pages [ 235mm x 191mm ]
Release Date : December 2010
ISBN : 1849512124
ISBN 13 : 9781849512121
Author(s) : Jeff McNeil
Topics and Technologies : All Books, Application Development, Beginner's Guides, Open Source, Python

Table of Contents

Chapter 1: Getting Started
Chapter 2: Working with the IO System
Chapter 3: Python String Services
Chapter 4: Text Processing Using the Standard Library
Chapter 5: Regular Expressions
Chapter 6: Structured Markup
Chapter 7: Creating Templates
Chapter 8: Understanding Encodings and i18n
Chapter 9: Advanced Output Formats
Chapter 10: Advanced Parsing and Grammars
Chapter 11: Searching and Indexing
Appendix A: Looking for Additional Resources
Appendix B: Pop Quiz Answers
  • Chapter 1: Getting Started
    • Categorizing types of text data
      • Providing information through markup
      • Meaning through structured formats
      • Understanding freeform content
    • Ensuring you have Python installed
      • Providing support for Python 3
    • Implementing a simple cipher
    • Time for action – implementing a ROT13 encoder
      • Processing structured markup with a filter
    • Time for action – processing as a filter
    • Time for action – skipping over markup tags
      • State machines
    • Supporting third-party modules
      • Packaging in a nutshell
    • Time for action – installing SetupTools
    • Running a virtual environment
      • Configuring virtualenv
    • Time for action – configuring a virtual environment
    • Where to get help?
    • Summary
    • Chapter 2: Working with the IO System
      • Parsing web server logs
      • Time for action – generating transfer statistics
      • Using objects interchangeably
      • Time for action – introducing a new log format
      • Accessing files directly
      • Time for action – accessing files directly
        • Context managers
        • Handling other file types
      • Time for action – handling compressed files
        • Implementing file-like objects
          • File object methods
          • Enabling universal newlines
      • Accessing multiple files
      • Time for action – spell-checking HTML content
        • Simplifying multiple file access
          • Inplace filtering
      • Accessing remote files
      • Time for action – spell-checking live HTML pages
        • Error handling
      • Time for action – handling urllib 2 errors
      • Handling string IO instances
      • Understanding IO in Python 3
      • Summary
      • Chapter 3: Python String Services
        • Understanding the basics of string object
          • Defining strings
        • Time for action – employee management
          • Building non-literal strings
        • String formatting
        • Time for action – customizing log processor output
          • Percent (modulo) formatting
            • Mapping key
            • Conversion flags
            • Minimum width
            • Precision
            • Width
            • Conversion type
          • Using the format method approach
        • Time for action – adding status code data
          • Making use of conversion specifiers
        • Creating templates
        • Time for action – displaying warnings on malformed lines
          • Template syntax
          • Rendering a template
        • Calling string object methods
        • Time for action – simple manipulation with string methods
          • Aligning text
          • Detecting character classes
          • Casing
          • Searching strings
          • Dealing with lists of strings
            • Treating strings as sequences
        • Summary
        • Chapter 4: Text Processing Using the Standard Library
          • Reading CSV data
          • Time for action – processing Excel formats
          • Time for action – CSV and formulas
            • Reading non-Excel data
          • Time for action – processing custom CSV formats
          • Writing CSV data
          • Time for action – creating a spreadsheet of UNIX users
          • Modifying application configuration files
          • Time for action – adding basic configuration read support
            • Using value interpolation
          • Time for action – relying on configuration value interpolation
            • Handling default options
          • Time for action – configuration defaults
          • Writing configuration data
          • Time for action – generating a configuration file
          • Reconfiguring our source
            • A note on Python 3
          • Time for action – creating an egg-based package
            • Understanding the file
          • Working with JSON
          • Time for action – writing JSON data
            • Encoding data
            • Decoding data
          • Summary
          • Chapter 5: Regular Expressions
            • Simple string matching
            • Time for action – testing an HTTP URL
              • Understanding the match function
              • Learning basic syntax
                • Detecting repetition
                • Specifying character sets and classes
                • Applying anchors to restrict matches
              • Wrapping it up
            • Advanced pattern matching
              • Grouping
            • Time for action – regular expression grouping
              • Using greedy versus non-greedy operators
              • Assertions
                • Performing an 'or' operation
            • Implementing Python-specific elements
              • Other search functions
                • search
                • findall and finditer
                • split
                • sub
              • Compiled expression objects
                • Dealing with performance issues
              • Parser flags
              • Unicode regular expressions
              • The match object
                • Processing bind zone files
            • Time for action – reading DNS records
            • Summary
            • Chapter 6: Structured Markup
              • XML data
              • SAX processing
              • Time for action – event-driven processing
                • Incremental processing
              • Time for action – driving incremental processing
                • Building an application
              • Time for action – creating a dungeon adventure game
              • The Document Object Model
                • xml.dom.minidom
              • Time for action – updating our game to use DOM processing
                • Creating and modifying documents programmatically
              • XPath
                • Accessing XML data using ElementTree
              • Time for action – using XPath in our adventure
              • Reading HTML
              • Time for action – displaying links in an HTML page
                • BeautifulSoup
              • Summary
              • Chapter 7: Creating Templates
                • Time for action – installing Mako
                • Basic Mako usage
                • Time for action – loading a simple Mako template
                  • Generating a template context
                  • Managing execution with control structures
                  • Including Python code
                • Time for action – reformatting the date with Python code
                  • Adding functionality with tags
                    • Rendering files with %include
                    • Generating multiline comments with %doc
                    • Documenting Mako with %text
                    • Defining functions with %def
                • Time for action – defining Mako def tags
                  • Importing %def sections using %namespace
                • Time for action – converting mail message to use namespaces
                  • Filtering output
                    • Expression filters
                    • Filtering the output of %def blocks
                    • Setting default filters
                • Inheriting from base templates
                • Time for action – updating base template
                  • Growing the inheritance chain
                • Time for action – adding another inheritance layer
                  • Inheriting attributes
                • Customizing
                  • Custom tags
                • Time for action – creating custom Mako tags
                  • Customizing filters
                • Overviewing alternative approaches
                • Summary
                • Chapter 8: Understanding Encodings and i18n
                  • Understanding basic character encodings
                    • ASCII
                      • Limitations of ASCII
                    • KOI8-R
                  • Unicode
                    • Using Unicode with Python 3
                    • Understanding Unicode
                      • Design goals
                    • Organizational structure
                    • Backwards compatibility
                    • Encoding
                      • UTF-32
                      • UTF-8
                  • Encodings in Python
                  • Time for action – manually decoding
                    • Reading Unicode
                    • Writing Unicode strings
                  • Time for action – copying Unicode data
                  • Time for action – fixing our copy application
                  • The codecs module
                  • Time for action – changing encodings
                  • Adopting good practices
                  • Internationalization and Localization
                    • Preparing an application for translation
                  • Time for action – preparing for multiple languages
                  • Time for action – providing translations
                    • Looking for more information on internationalization
                  • Summary
                  • Chapter 9: Advanced Output Formats
                    • Dealing with PDF files using PLATYPUS
                    • Time for action – installing ReportLab
                      • Generating PDF documents
                    • Time for action – writing PDF with basic layout and style
                    • Writing native Excel data
                    • Time for action – installing xlwt
                      • Building XLS documents
                    • Time for action – generating XLS data
                    • Working with OpenDocument files
                    • Time for action – installing ODFPy
                      • Building an ODT generator
                    • Time for action – generating ODT data
                    • Summary
                    • Chapter 10: Advanced Parsing and Grammars
                      • Defining a language syntax
                        • Specifying grammar with Backus-Naur Form
                        • Grammar-driven parsing
                      • PyParsing
                      • Time for action – installing PyParsing
                      • Time for action – implementing a calculator
                        • Parse actions
                      • Time for action – handling type translations
                        • Suppressing parts of a match
                      • Time for action – suppressing portions of a match
                      • Processing data using the Natural Language Toolkit
                      • Time for action – installing NLTK
                        • NLTK processing examples
                          • Removing stems
                          • Discovering collocations
                      • Summary
                      • Chapter 11: Searching and Indexing
                        • Understanding search complexity
                        • Time for action – implementing a linear search
                        • Text indexing
                        • Time for action – installing Nucular
                          • An introduction to Nucular
                        • Time for action – full text indexing
                        • Time for action – measuring index benefit
                          • Scripts provided by Nucular
                          • Using XML files
                          • Advanced Nucular features
                        • Time for action – field-qualified indexes
                          • Performing an enhanced search
                        • Time for action – performing advanced Nucular queries
                        • Indexing and searching other data
                        • Time for action – indexing Open Office documents
                        • Other index systems
                          • Apache Lucene
                          • ZODB and zc.catalog
                          • SQL text indexing
                        • Summary
                        • Appendix A: Looking for Additional Resources
                          • Python resources
                            • Unofficial documentation
                            • Python enhancement proposals
                            • Self-documenting
                              • Using other documentation tools
                            • Community resources
                              • Following groups and mailing lists
                              • Finding a users' group
                              • Attending a local Python conference
                          • Honorable mention
                            • Lucene and Solr
                            • Generating C-based parsers with GNU Bison
                            • Apache Tika
                          • Getting started with Python 3
                            • Major language changes
                              • Print is now a function
                              • Catching exceptions
                              • Using metaclasses
                              • New reserved words
                              • Major library changes
                              • Changes to list comprehensions
                            • Migrating to Python 3
                          • Time for action – using 2to3 to move to Python 3
                          • Summary
                          • Appendix B: Pop Quiz Answers
                            • Chapter 1: Getting Started
                              • ROT 13 Processing Answers
                            • Chapter 2: Working with the IO System
                              • File-like objects
                            • Chapter 3: Python String Services
                              • String literals
                              • String formatting
                            • Chapter 4: Text Processing Using the Standard Library
                              • CSV handling
                              • JSON formatting
                            • Chapter 5: Regular Expressions
                              • Regular expressions
                              • Understanding the Pythonisms
                            • Chapter 6: Structured Markup
                              • SAX processing
                            • Chapter 7: Creating Templates
                              • Template inheritance
                            • Chapter 8: Understanding Encoding and i18n
                              • Character encodings
                              • Python encodings
                              • Internationalization
                            • Chapter 9: Advanced Output Formats
                              • Creating XLS documents
                            • Chapter 11: Searching and Indexing
                              • Introduction to Nucular

                            Jeff McNeil

                            Jeff McNeil has been working in the Internet Services industry for over 10 years. He cut his teeth during the late 90's Internet boom and has been developing software for Unix and Unix-flavored systems ever since. Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl. He takes an interest in systems administration and server automation problems. Jeff recently joined Google and has had the pleasure of working with some very talented individuals.
                            Sorry, we don't have any reviews for this title yet.

                            Code Downloads

                            Download the code and support files for this book.

                            Submit Errata

                            Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

                            Sample chapters

                            You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                            Frequently bought together

                            Python 2.6 Text Processing: Beginners Guide +    Joomla! 2.5 Beginner’s Guide =
                            50% Off
                            the second eBook
                            Price for both: £24.65

                            Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                            What you will learn from this book

                            • Know the options available for processing text in Python
                            • Parse JSON data that is often used as a data delivery mechanism on the Internet
                            • Organize a log-processing application via modules and packages to make it more extensible
                            • Perform conditional matches via look-ahead and look-behind assertions by using basic regular expressions
                            • Process XML and HTML documents in a variety of ways based on the needs of your application
                            • Implement callback methods to perform SAX processing and walk in-memory DOM structures
                            • Understand Unicode, character encoding, internationalization, and localization
                            • Lay out a Mako template-based project by using techniques such as template inheritance, additional tags, and custom filters
                            • Install and use the Mako templating system to create your own Mako templates
                            • Process a large number of e-mail messages using the Python standard library and index them with Nucular for fast searching
                            • Fix common exceptions that occur while dealing with different types of text encoding
                            • Build simple PDF output using the ReportLab toolkit's high-level PLATYPUS framework
                            • Generate Microsoft Excel output using the xlwt module
                            • Open and edit existing Open Document files to use them as template sources
                            • Understand supporting functions and classes, such as the Python IO system and packaging components

                            In Detail

                            For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python.

                            Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another.

                            The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

                            Learn text processing techniques and work with the most popular Python libraries for transforming text from one form to another


                            This book is part of the Beginner's Guide series. Each chapter covers the steps for various tasks to process data followed by brief explanation of what is happening in each task. The explanation is followed by a few questions on the topic under discussion that will serve as a refresher course for you.

                            Who this book is for

                            This book is for people who have text in one format, and need it in another, as quickly as possible. You don't need any experience with text processing, but you will need some basic knowledge of Python.

                            Code Download and Errata
                            Packt Anytime, Anywhere
                            Register Books
                            Print Upgrades
                            eBook Downloads
                            Video Support
                            Contact Us
                            Awards Voting Nominations Previous Winners
                            Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                            Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software