Working with files and folders
Our computer is full of files. One of the most important features of our operating system is the way it handles files and devices. Python gives us an outstanding level of access to various kinds of files.
However, we've got to draw a few lines. All files consist of bytes. This is a reductionist view that's not always helpful. Sometimes those bytes represent Unicode characters which makes reading the file is relatively easy. Sometimes those bytes represent more complex objects which makes reading the file may be quite difficult.
Pragmatically, files come in a wide variety of physical formats. Our various desktop applications (word processors, spread sheets, and so on) all have unique formats for the data. Some of those physical formats are proprietary products, and this makes them exceptionally difficult to work with. The contents are obscure (not secure) and the cost of ferreting out the information can be extraordinary. We can always resort to examining the low-level bytes and recovering information that way.
Many applications work with files in widely standardized formats. This makes our life much simpler. The format may be complex, but the fact that it conforms to a standard means that we can recover all of the data. We'll look at a number of standardized formats for subsequent missions. For now, we need to get the basics under our belts.
Creating a file
We'll start by creating a text file that we can work with. There are several interesting aspects to working with files. We'll focus on the following two aspects:
Creating a
file
object. Thefile
object is the Python view of an operating system resource. It's actually rather complex, but we can access it very easily.Using the file context. A file has a particular life: open, read or write, and then close. To be sure that we close the file and properly disentangle the OS resources from the Python object, we're usually happiest using a file as a context manager. Using a
with
statement guarantees that the file is properly closed.
Our general template, with open("message1.txt", "w")
as target, for creating a file looks like this:
print( "Message to HQ", file=target ) print( "Device Size 10 31/32", file=target )
We'll open the file with the open()
function. In this case, the file is opened in write mode. We've used the print()
function to write some data into the file.
Once the program finishes the indented context of the with
statement, the file is properly closed and the OS resources are released. We don't need to explicitly close the file
object.
We can also use something like this to create our file:
text="""Message to HQ\n Device Size 10 31/32\n""" with open("message1.txt", "w") as target: target.write(text)
Note the important difference here. The print()
function automatically ends each line with a \n
character. The write()
method of a file object doesn't add anything.
In many cases, we may have more complex physical formatting for a file. We'll look at JSON or CSV files in a later section. We'll also look at reading and writing image files in Chapter 3, Encoding Secret Messages with Steganography.
Reading a file
Our general template for reading a file looks like this:
with open("message1.txt", "r") as source: text= source.read() print( text )
This will create the file
object, but it will be in read mode. If the file doesn't exist, we'll get an exception. The read()
function will slurp the entire file into a single block of text. Once we're done reading the content of the file, we're also done with the with
context. The file can be closed and the resources can be released. The text variable that we created will have the file's contents ready for further processing.
In many cases, we want to process the lines of the text separately. For this, Python gives us the for
loop. This statement interacts with files to iterate through each line of the file, as shown in the following code:
with open("message1.txt", "r") as source: for line in source: print(line)
The output looks a bit odd, doesn't it?
It's double-spaced because each line read from the file contains a \n
character at the end. The print()
function automatically includes a \n
character. This leads to double-spaced output.
We have two candidate fixes. We can tell the print()
function not to include a \n
character. For example, print(line, end="")
does this.
A slightly better fix is to use the rstrip()
method to remove the trailing whitespace from the right-hand end of line. This is slightly better because it's something we'll do often in a number of contexts. Attempting to suppress the output of the extra \n
character in the print()
function is too specialized to this one situation.
In some cases, we may have files where we need to filter the lines, looking for particular patterns. We might have a loop that includes conditional processing via the if
statement, as shown in the following code:
with open("message1.txt", "r") as source: for line in source: junk1, keyword, size= line.rstrip().partition("Size") if keyword != '': print( size )
This shows a typical structure for text processing programs. First, we open the file via a with
statement context; this assures us that the file will be closed properly no matter what happens.
We use the for
statement to iterate through all lines of the file. Each line has a two-step process: the rstrip()
method removes trailing whitespace, the partition()
method breaks the line around the keyword Size
.
The if
statement defines a condition (keyword != ''
) and some processing that's done only if the condition is True
. If the condition is False
(the value of keyword
is ''
), the indented body of the if
statement is silently skipped.
The assignment and if
statements form the body of the for
statement. These two statements are executed once for every line in the file. When we get to the end of the for
statement, we can be assured that all lines were processed.
We have to note that we can create an exception to the usual for all lines assumption about processing files with the for
statement. We can use the break
statement to exit early from the loop, breaking the usual assumption. We'd prefer to avoid the break
statement, making it easy to see that a for
statement works for all lines of a file.
At the end of the for
statement, we're done processing the file. We're done with the with
context, too. The file will be closed.
Defining more complex logical conditions
What if we have more patterns than what we're looking for? What if we're processing more complex data?
Let's say we've got something like this in a file:
Message to Field Agent 006 1/2 Proceed to Rendezvous FM16uu62 Authorization to Pay $250 USD
We're looking for two keywords: Rendezvous
and Pay
. Python gives us the elif
clause as part of the if
statement. This clause provides a tidy way to handle multiple conditions gracefully. Here's a script to parse a message to us from the headquarters:
amount= None location= None with open("message2.txt", "r") as source: for line in source: clean= line.lower().rstrip() junk, pay, pay_data= clean.partition("pay") junk, meet, meet_data= clean.partition("rendezvous") if pay != '': amount= pay_data elif meet != '': location= meet_data else: pass # ignore this line print("Budget", amount, "Meet", location)
We're searching the contents in the file for two pieces of information: the rendezvous location and the amount we can use to bribe our contact. In effect, we're going to summarize this file down to two short facts, discarding the parts we don't care about.
As with the previous examples, we're using a with
statement to create a processing context. We're also using the for
statement to iterate through all lines of the file.
We've used a two-step process to clean each line. First, we used the lower()
method to create a string in lowercase. Then we used the rstrip()
method to remove any trailing whitespace from the line.
We applied the partition()
method to the cleaned line twice. One partition looked for pay
and the other partition looked for rendezvous
. If the line could be partitioned on pay
, the pay
variable (and pay_data
) would have values not equal to a zero-length string. If the line could be partitioned on rendezvous
, then the meet
variable (and meet_data
) would have values not equal to a zero-length string. The else, if
is abbreviated elif
in Python.
If none of the previous conditions are true, we don't need to do anything. We don't need an else:
clause. But we decided to include the else:
clause in case we later needed to add some processing. For now, there's nothing more to do. In Python, the pass
statement does nothing. It's a syntactic placeholder; a thing to write when we must write something.