Chapter 2. Python Tips for Text Analysis
We mentioned in Chapter 1, What is Text Analysis, that we will be using Python throughout the book because it is an easy-to-use and powerful language. In this chapter, we will substantiate these claims, while also providing a revision course in basic Python for text analysis.
Why is this important? While we expect readers of the book to have a background in Python and high-school level math, it is still possible that it's been a while since you've written Python code – and even if you have, the Python code you write during text analysis and string manipulation is quite different from, say, building a website using the web framework Django. Following are the topics we will cover in this chapter:
- Why Python?
- Text manipulation in Python
In Python, we represent text in the form of string [1], which are objects of the str
[2] class. They are an immutable sequence of Unicode code points or characters. It is important to make a careful distinction here, though; in Python 3, all strings are by default Unicode, but in Python 2, the str
class is limited to ASCII code, and there is a Unicode class to deal with Unicodes.
Unicode is merely an encoding language or a way we handle text. For example, the Unicode value for the letter Z is U+005A. There are many encoding types, and historically in Python, developers were expected to deal with different encodings on their own, with all the low-level action happening in bytes. In fact, the shift in the way Python handles Unicode has led to a lot of discussions [3], criticism [4], and praise [5] within the community. It also remains an important point of contention when we are porting code from Python 2 and Python 3.
We said earlier on that the low-level action was going on in...
Text manipulation in Python
We mentioned earlier in the chapter that the way we represent text in Python is through strings. So how do we specify that an object is a string?
word = "Bonjour World!"
Now the word
variable contains the text, Bonjour World!
. Note how we used double quotes around the text that we intend to use - while single quotes also work; if we also wish to use a single quote in our string, we would need to use double quotes. Printing our word is straightforward, where all we need to do is use the print function. Remember to use parentheses if we are coding in Python 3!
print(word)
Bonjour World!
We don't have to use variables to be able to print string though - we can also just do:
print("Bonjour World!")
Bonjour World!
Be careful not to enclose your variable in quotations though! Consider this example:
print("word")
word
This will just print the word out.
We mentioned before in the chapter that a string is a sequence of characters; how do we then access the first character of a...
With the knowledge of the functions and strategies we have discussed, our text analysis can be aided; it is often when we are doing large scale text analysis that a small error can lead to completely nonsense results (remember garbage in, garbage out from Chapter 1, What is Text Analysis?).
We finish this mini-chapter with a few useful links on basic text manipulation:
- Printing and Manipulating Text [9]: Basic manipulation and printing of text, recommended if interested in how to display text in different ways.
- Manipulating Strings [10]: Basic String functions as well as exercises, useful for the further practice of string manipulation.
- Manipulating Strings in Python [11]: Similar to the two-preceding links includes a section on escape sequences as well.
- Text Processing in Python (book) [12]: Unlike the other links, this is a whole book. It covers the very fundamentals of text and string manipulation in Python and includes useful material on some uncovered topics such as regular expressions...