Writing code isn't easy. Even the best programmer in the world can't foresee any possible alternative and flow of the code. This means that executing our code will always produce surprises and unexpected behavior. Some will be very evident and others will be very subtle, but the ability to identify and remove these defects in the code is critical to building solid software.
These defects in software are known as bugs, and therefore removing them is called debugging. Inspecting the code just by reading it is not great. There are always surprises, and complex code is difficult to follow. That's why the ability to debug by stopping execution and taking a look at the current state of things is important.
This article is an excerpt from a book written by Jaime Buelta titled Python Automation Cookbook. The Python Automation Cookbook helps you develop a clear understanding of how to automate your business processes using Python, including detecting opportunities by scraping the web, analyzing information to generate automatic spreadsheets reports with graphs, and communicating with automatically generated emails. To follow along with the examples implemented in the article, you can find the code on the book's GitHub repository.
In this article, we will see some of the tools and techniques for debugging, and apply them specifically to Python scripts. The scripts will have some bugs that we will fix as part of the recipe.
A simple, yet very effective, debugging approach is to output variables and other information at strategic parts of your code to follow the flow of the program. The simplest form of this approach is called print debugging or inserting print statements at certain points to print the value of variables or points while debugging.
But taking this technique a little bit further and combining it with the logging techniques allows us to create a semi-permanent trace of the execution of the program, which can be really useful when detecting issues in a running program.
Download the debug_logging.py file from GitHub. It contains an implementation of the bubble sort algorithm, which is the simplest way to sort a list of elements. It iterates several times over the list, and on each iteration, two adjacent values are checked and interchanged, so the bigger one is after the smaller. This makes the bigger values ascend like bubbles in the list.
When run, it checks the following list to verify that it is correct:
assert [1, 2, 3, 4, 7, 10] == bubble_sort([3, 7, 10, 2, 4, 1])
$ python debug_logging.py INFO:Sorting the list: [3, 7, 10, 2, 4, 1] INFO:Sorted list: [2, 3, 4, 7, 10, 1] Traceback (most recent call last): File "debug_logging.py", line 17, in <module> assert [1, 2, 3, 4, 7, 10] == bubble_sort([3, 7, 10, 2, 4, 1]) AssertionError
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
Change the preceding line to the following one:
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
Note the different level.
$ python debug_logging.py INFO:Sorting the list: [3, 7, 10, 2, 4, 1] DEBUG:alist: [3, 7, 10, 2, 4, 1] DEBUG:alist: [3, 7, 10, 2, 4, 1] DEBUG:alist: [3, 7, 2, 10, 4, 1] DEBUG:alist: [3, 7, 2, 4, 10, 1] DEBUG:alist: [3, 7, 2, 4, 10, 1] DEBUG:alist: [3, 2, 7, 4, 10, 1] DEBUG:alist: [3, 2, 4, 7, 10, 1] DEBUG:alist: [2, 3, 4, 7, 10, 1] DEBUG:alist: [2, 3, 4, 7, 10, 1] DEBUG:alist: [2, 3, 4, 7, 10, 1] INFO:Sorted list : [2, 3, 4, 7, 10, 1] Traceback (most recent call last): File "debug_logging.py", line 17, in <module> assert [1, 2, 3, 4, 7, 10] == bubble_sort([3, 7, 10, 2, 4, 1]) AssertionError
for passnum in reversed(range(len(alist) - 1)):
Change the preceding line to the following one:
for passnum in reversed(range(len(alist))):
(Notice the removal of the -1 operation.)
$ python debug_logging.py INFO:Sorting the list: [3, 7, 10, 2, 4, 1] ... INFO:Sorted list : [1, 2, 3, 4, 7, 10]
Step 1 presents the script and shows that the code is faulty, as it's not properly sorting the list. The script already has some logs to show the start and end result, as well as some debug logs that show each intermediate step.
In step 2, we activate the display of the DEBUG logs, as in step 1 only the INFO ones were shown.
Step 3 runs the script again, this time displaying extra information, showing that the last element in the list is not sorted.
The bug is an off-by-one error, a very common kind of error, as it should iterate to the whole size of the list. This is fixed in step 4.
Step 5 shows that the fixed script runs correctly.
Python has a ready-to-go debugger called pdb. Given that Python code is interpreted, this means that stopping the execution of the code at any point is possible by setting a breakpoint, which will jump into a command line where any code can be used to analyze the situation and execute any number of instructions.
Let's see how to do it.
Download the debug_algorithm.py script, available from GitHub. The code checks whether numbers follow certain properties:
def valid(candidate): if candidate <= 1: return False
lower = candidate - 1
while lower > 1:
if candidate / lower == candidate // lower:
return False
lower -= 1
return True
assert not valid(1)
assert valid(3)
assert not valid(15)
assert not valid(18)
assert not valid(50)
assert valid(53)
It is possible that you recognize what the code is doing but bear with me so that we can analyze it interactively.
$ python debug_algorithm.py
while lower > 1: breakpoint() if candidate / lower == candidate // lower:
$ python debug_algorithm.py > .../debug_algorithm.py(8)valid() -> if candidate / lower == candidate // lower: (Pdb)
(Pdb) candidate 3 (Pdb) candidate / lower 1.5 (Pdb) candidate // lower 1
(Pdb) n > ...debug_algorithm.py(10)valid() -> lower -= 1 (Pdb) n > ...debug_algorithm.py(6)valid() -> while lower > 1: (Pdb) n > ...debug_algorithm.py(12)valid() -> return True (Pdb) n --Return-- > ...debug_algorithm.py(12)valid()->True -> return True
(Pdb) c > ...debug_algorithm.py(8)valid() -> if candidate / lower == candidate // lower: (Pdb) candidate 15 (Pdb) lower 14
(Pdb) q ... bdb.BdbQuit
The code is, as you probably know already, checking whether a number is a prime number. It tries to divide the number by all integers lower than it. If at any point is divisible, it returns a False result, because it's not a prime.
After checking the general execution in step 1, in step 2, we introduced a breakpoint in the code.
When the code is executed in step 3, it stops at the breakpoint position, entering into an interactive mode. In the interactive mode, we can inspect the values of any variable as well as perform any kind of operation.
As demonstrated in step 4, sometimes, a line of code can be better analyzed by reproducing its parts. The code can be inspected and regular operations can be executed in the command line.
The next line of code can be executed by calling n(ext), as done in step 5 several times, to see the flow of the code.
Step 6 shows how to resume the execution with the c(ontinue) command in order, to stop in the next breakpoint. All these operations can be iterated to see the flow and values, and to understand what the code is doing at any point.
The execution can be stopped with q(uit), as demonstrated in step 7.
In this recipe, we will analyze a small script that replicates a call to an external service, analyzing it and fixing some bugs. We will show different techniques to improve the debugging.
The script will ping some personal names to an internet server (httpbin.org, a test site) to get them back, simulating its retrieval from an external server. It will then split them into first and last name and prepare them to be sorted by surname. Finally, it will sort them.
The script contains several bugs that we will detect and fix.
For this recipe, we will use the requests and parse modules and include them in our virtual environment:
$ echo "requests==2.18.3" >> requirements.txt $ echo "parse==1.8.2" >> requirements.txt $ pip install -r requirements.txt
The debug_skills.py script is available from GitHub. Note that it contains bugs that we will fix as part of this recipe.
$ python debug_skills.py Traceback (most recent call last): File "debug_skills.py", line 26, in <module> raise Exception(f'Error accessing server: {result}') Exception: Error accessing server: <Response [405]>
# ERROR Step 2. Using .get when it should be .post # (old) result = requests.get('http://httpbin.org/post', json=data) result = requests.post('http://httpbin.org/post', json=data)
We keep the old buggy code commented with (old) for clarity of changes.
$ python debug_skills.py Traceback (most recent call last): File "debug_skills_solved.py", line 34, in <module> first_name, last_name = full_name.split() ValueError: too many values to unpack (expected 2)
$ python debug_skills_solved.py ..debug_skills.py(35)<module>() -> first_name, last_name = full_name.split() (Pdb) n > ...debug_skills.py(36)<module>() -> ready_name = f'{last_name}, {first_name}' (Pdb) c > ...debug_skills.py(34)<module>() -> breakpoint()
Running n does not produce an error, meaning that it's not the first value. After a few runs on c, we realize that this is not the correct approach, as we don't know what input is the one generating the error.
try: first_name, last_name = full_name.split() except: breakpoint()
$ python debug_skills.py > ...debug_skills.py(38)<module>() -> ready_name = f'{last_name}, {first_name}' (Pdb) full_name 'John Paul Smith'
# ERROR Step 6 split only two words. Some names has middle names # (old) first_name, last_name = full_name.split() first_name, last_name = full_name.rsplit(maxsplit=1)
$ python debug_skills_solved.py ['Berg, Keagan', 'Cordova, Mai', 'Craig, Michael', 'Garc\\u00eda, Roc\\u00edo', 'Mccabe, Fathima', "O'Carroll, S\\u00e9amus", 'Pate, Poppy-Mae', 'Rennie, Vivienne', 'Smith, John Paul', 'Smyth, John', 'Sullivan, Roman']
Who's called O'Carroll, S\\u00e9amus?
full_name = parse.search('"custname": "{name}"', raw_result)['name'] if "O'Carroll" in full_name: breakpoint()
$ python debug_skills.py > debug_skills.py(38)<module>() -> first_name, last_name = full_name.rsplit(maxsplit=1) (Pdb) full_name "S\\u00e9amus O'Carroll"
(Pdb) full_name "S\\u00e9amus O'Carroll" (Pdb) raw_result '{"custname": "S\\u00e9amus O\'Carroll"}' (Pdb) result.json() {'args': {}, 'data': '{"custname": "S\\u00e9amus O\'Carroll"}', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '37', 'Content-Type': 'application/json', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.3'}, 'json': {'custname': "Séamus O'Carroll"}, 'origin': '89.100.17.159', 'url': 'http://httpbin.org/post'}
(Pdb) result.json()['json'] {'custname': "Séamus O'Carroll"} (Pdb) type(result.json()['json']) <class 'dict'>
# ERROR Step 11. Obtain the value from a raw value. Use # the decoded JSON instead # raw_result = result.json()['data'] # Extract the name from the result # full_name = parse.search('"custname": "{name}"', raw_result)['name'] raw_result = result.json()['json'] full_name = raw_result['custname']
$ python debug_skills.py ['Berg, Keagan', 'Cordova, Mai', 'Craig, Michael', 'García, Rocío', 'Mccabe, Fathima', "O'Carroll, Séamus", 'Pate, Poppy-Mae', 'Rennie, Vivienne', 'Smith, John Paul', 'Smyth, John', 'Sullivan, Roman']
This time, it's all correct! You have successfully debugged the program!
The structure of the recipe is divided into three different problems. Let's analyze it in small blocks:
First error—Wrong call to the external service:
After showing the first error in step 1, we read with care the resulting error, saying that the server is returning a 405 status code. This corresponds to a method not allowed, indicating that our calling method is not correct.
Inspect the following line:
result = requests.get('http://httpbin.org/post', json=data)
It gives us the indication that we are using a GET call to one URL that's defined for POST, so we make the change in step 2.
We run the code in step 3 to find the next problem.
Second error—Wrong handling of middle names:
In step 3, we get an error of too many values to unpack. We create a breakpoint to analyze the data in step 4 at this point but discover that not all the data produces this error. The analysis done in step 4 shows that it may be very confusing to stop the execution when an error is not produced, having to continue until it does. We know that the error is produced at this point, but only for certain kind of data.
As we know that the error is being produced at some point, we capture it in a try..except block in step 5. When the exception is produced, we trigger the breakpoint.
This makes step 6 execution of the script to stop when the full_name is 'John Paul Smith'. This produces an error as the split expects two elements, not three.
This is fixed in step 7, allowing everything except the last word to be part of the first name, grouping any middle name(s) into the first element. This fits our purpose for this program, to sort by the last name.
The following line does that with rsplit:
first_name, last_name = full_name.rsplit(maxsplit=1)
It divides the text by words, starting from the right and making a maximum of one split, guaranteeing that only two elements will be returned.
When the code is changed, step 8 runs the code again to discover the next error.
Third error—Using a wrong returned value by the external service:
Running the code in step 8 displays the list and does not produce any errors. But, examining the results, we can see that some of the names are incorrectly processed.
We pick one of the examples in step 9 and create a conditional breakpoint. We only activate the breakpoint if the data fulfills the if condition.
The code is run again in step 10. From there, once validated that the data is as expected, we worked backward to find the root of the problem. Step 11 analyzes previous values and the code up to that point, trying to find out what lead to the incorrect value.
We then discover that we used the wrong field in the returned value from the result from the server. The value in the json field is better for this task and it's already parsed for us. Step 12 checks the value and sees how it should be used.
In step 13, we change the code to adjust. Notice that the parse module is no longer needed and that the code is actually cleaner using the json field.
Once this is fixed, the code is run again in step 14. Finally, the code is doing what's expected, sorting the names alphabetically by surname. Notice that the other name that contained strange characters is fixed as well.
To summarize, this article discussed different methods and tips to help in the debugging process and ensure the quality of your software. It leverages the great introspection capabilities of Python and its out-of-the-box debugging tools for fixing problems and producing solid automated software.
If you found this post useful, do check out the book, Python Automation Cookbook. This book helps you develop a clear understanding of how to automate your business processes using Python, including detecting opportunities by scraping the web, analyzing information to generate automatic spreadsheets reports with graphs, and communicating with automatically generated emails.
Getting started with Web Scraping using Python [Tutorial]
How to perform sentiment analysis using Python [Tutorial]
How to predict viral content using random forest regression in Python [Tutorial]