Extending Excel with Python and R

Reading Excel Spreadsheets

In the deep and wide landscape of data analysis, Excel stands tall and by your side as a trusted warrior, simplifying the process of organizing, calculating, and presenting information. Its intuitive interface and widespread usage have cemented its position as a staple in the business world. However, as the volume and complexity of data continue to grow exponentially, Excel’s capabilities may start to feel constrained. It is precisely at this point that the worlds of Excel, R, and Python converge. Extending Excel with R and Python invites you to embark on a truly transformative journey. This trip will show you the power of these programming languages as they synergize with Excel, expanding its horizons and empowering you to conquer data challenges with ease. In this book, we will delve into how to integrate Excel with R and Python, uncovering the hidden potential that lies beneath the surface and enabling you to extract valuable insights, automate processes, and unleash the true power of data analysis.

Microsoft Excel came to market in 1985 and has remained a popular spreadsheet software choice. Excel was originally known as MultiPlan. Microsoft Excel and databases in general share some similarities in terms of organizing and managing data, although they serve different purposes. Excel is a spreadsheet program that allows users to store and manipulate data in a tabular format. It consists of rows and columns, where each cell can contain text, numbers, or formulas. Similarly, a database is a structured collection of data stored in tables, consisting of rows and columns.

Both Excel and databases provide a way to store and retrieve data. In Excel, you can enter data, perform calculations, and create charts and graphs. Similarly, databases store and manage large amounts of structured data and enable querying, sorting, and filtering. Excel and databases also support the concept of relationships. In Excel, you can link cells or ranges across different sheets, creating connections between data. Databases use relationships to link tables based on common fields, allowing you to retrieve related data from multiple tables.

This chapter aims to familiarize you with reading Excel files into the R environment and performing some manipulation on them. Specifically, in this chapter, we’re going to cover the following main topics:

R packages for Excel manipulation
Reading Excel files to manipulate with R
Reading multiple Excel sheets with a custom R function
Python packages for Excel manipulation
Opening an Excel sheet from Python and reading the data

Technical requirements

At the time of writing, we are using the following:

R 4.2.1
The RStudio 2023.03.1+446 “Cherry Blossom” release for Windows

For this chapter, you will need to install the following packages:

readxl
openxlsx
xlsx

To run the Python code in this chapter, we will be using the following:

Python 3.11
pandas
openpyxl
The iris.xlsx Excel file available in this book’s GitHub repository

While setting up a Python environment is outside the scope of this book, this is easy to do. The necessary packages can be installed by running the following commands:

python -m pip install pandas==2.0.1
python -m pip install openpyxl==3.1.2

Note that these commands have to be run from a terminal and not from within a Python script. They need to be run in the folder where requirements.txt resides or a full path to the requirements.txt file has to be included.

This book’s GitHub repository also contains a requirements.txt file that you can use to install all dependencies. You can do this by running the following command:

python -m pip install -r requirements.txt

This command installs all the packages that will be used in this chapter so that you don’t have to install them one by one. It also guarantees that the whole dependency tree (including the dependencies of the dependencies) will be the same as what this book’s authors have used.

Alternatively, when using Jupyter Notebooks, you can use the following magic commands:

%pip install pandas==2.0.1
%pip install openpyxl==3.1.2

There is a GitHub account for all of the code examples in this book located at this link: https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R. Each chapter has it’s own folder, with the current one as Chapter01.

Note

Technical requirements for Python throughout the book are conveniently compiled in the requirements.txt file, accessible on GitHub repository here, https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R/blob/main/requirements.txt. Installing these dependencies will streamline your coding experience and ensure smooth progression through the book. Be sure to install them all before diving into the exercises.

Reading Excel files to R

In this section, we are going to read data from Excel with a few different R libraries. We need to do this before we can even consider performing any type of manipulation or analysis on the data contained in the sheets of the Excel files.

As mentioned in the Technical requirements section, we are going to be using the readxl, openxlsx, and xlsx packages to read data into R.

Installing and loading libraries

In this section, we are going to install and load the necessary libraries if you do not yet have them. We are going to use the openxlsx, xlsx, readxl, and readxlsb libraries. To install and load them, run the following code block:

pkgs <- c("openxlsx", "xlsx", "readxl")
install.packages(pkgs, dependencies = TRUE)
lapply(pkgs, library, character.only = TRUE)

The lapply() function in R is a versatile tool for applying a function to each element of a list, vector, or DataFrame. It takes two arguments, x and FUN, where x is the list and FUN is the function that is applied to the list object, x.

Now that the libraries have been installed, we can get to work. To do this, we are going to read a spreadsheet built from the Iris dataset that is built into base R. We are going to read the file with three different libraries, and then we are going to create a custom function to work with the readxl library that will read all the sheets of an Excel file. We will call this the read_excel_sheets() function.

Let’s start reading the files. The first library we will use to open an Excel file is openxlsx. To read the Excel file we are working with, you can run the code in the chapter1 folder of this book’s GitHub repository called ch1_create_iris_dataset.R Refer to the following screenshot to see how to read the file into R.

You will notice a variable called f_pat. This is the path to where the Iris dataset was saved as an Excel file – for example, C:/User/UserName/Documents/iris_data.xlsx:

Figure 1.1 – Using the openxlsx package to read the Excel file

The preceding screenshot shows how to read an Excel file. This example assumes that you have used the ch1_create_iris_datase.R file to create the example Excel file. In reality, you can read in any Excel file that you would like or need.

Now, we will perform the same type of operation, but this time with the xlsx library. Refer to the following screenshot, which uses the same methodology as with the openxlsx package:

Figure 1.2 – Using the xlsx library and the read.xlsx() function to open the Excel file we’ve created

Finally, we will use the readxl library, which is part of the tidyverse:

Figure 1.3 – Using the readxl library and the read_excel() function to read the Excel file into memory

In this section, we learned how to read in an Excel file with a few different packages. While these packages can do more than simply read in an Excel file, that is what we needed to focus on in this section. You should now be familiar with how to use the readxl::read_excel(), xlsx::read.xlsx(), and openxlsx::read.xlsx() functions.

Building upon our expertise in reading Excel files into R, we’ll now embark on the next phase of our journey: unraveling the secrets of efficiently extracting data from multiple sheets within an Excel file.

Reading multiple sheets with readxl and a custom function

In Excel, we often encounter workbooks that have multiple sheets in them. These could be stats for different months of the year, table data that follows a specific format month over month, or some other period. The point is that we may want to read all the sheets in a file for one reason or another, and we should not call the read function from a particular package for each sheet. Instead, we should use the power of R to loop through this with purrr.

Let’s build a customized function. To do this, we are going to load the readxl function. If we have it already loaded, then this is not necessary; however, if it is already installed and you do not wish to load the library into memory, then you can call the excel_sheets() function by using readxl::excel_sheets():

Figure 1.4 – Creating a function to read all the sheets into an Excel file at once – read_excel_sheets()

The new code can be broken down as follows:

 read_excel_sheets <- function(filename, single_tbl) {

This line defines a function called read_excel_sheets that takes two arguments: filename (the name of the Excel file to be read) and single_tbl (a logical value indicating whether the function should return a single table or a list of tables).

Next, we have the following line:

sheets <- readxl::excel_sheets(filename)

This line uses the readxl package to extract the names of all the sheets in the Excel file specified by filename. The sheet names are stored in the sheets variable.

Here’s the next line:

if (single_tbl) {

This line starts an if statement that checks the value of the single_tbl argument.

Now, we have the following:

x <- purrr::map_df(sheets, read_excel, path = filename)

If single_tbl is TRUE, this line uses the purrr package’s map_df function to iterate over each sheet name in sheets and read the corresponding sheet using the read_excel function from the readxl package. The resulting DataFrame are combined into a single table, which is assigned to the x variable.

Now, we have the following line:

} else {

This line indicates the start of the else block of the if statement. If single_tbl is FALSE, the code in this block will be executed.

Here’s the next line:

 x <- purrr::map(sheets, ~ readxl::read_excel(filename, sheet = .x))

In this line, the purrr package’s map function is used to iterate over each sheet name in sheets. For each sheet, the read_excel function from the readxl package is called to read the corresponding sheet from the Excel file specified by filename. The resulting DataFrame are stored in a list assigned to the x variable.

Now, we have the following:

 purrr::set_names(x, sheets)

This line uses the set_names function from the purrr package to set the names of the elements in the x list to the sheet names in sheets.

Finally, we have the following line:

This line returns the value of x from the function, which will be either a single table (data.frame) if single_tbl is TRUE, or a list of tables (data.frame) if single_tbl is FALSE.

In summary, the read_excel_sheets function takes an Excel filename and a logical value indicating whether to return a single table or a list of tables. It uses the readxl package to extract the sheet names from the Excel file, and then reads the corresponding sheets either into a single table (if single_tbl is TRUE) or into a list of tables (if single_tbl is FALSE). The resulting data is returned as the output of the function. To see how this works, let’s look at the following example.

We have a spreadsheet that has four tabs in it – one for each species in the famous Iris dataset and then one sheet called iris, which is the full dataset.

As shown in Figure 1.5, the read_excel_sheets() function has read all four sheets of the Excel file. We can also see that the function has imported the sheets as a list object and has named each item in the list after the name of the corresponding tab in the Excel file. It is also important to note that the sheets must all have the same column names and structure for this to work:

Figure 1.5 – Excel file read by read_excel_sheets()

In this section, we learned how to write a function that will read all of the sheets in any Excel file. This function will also return them as a named item list, where the names are the names of the tabs in the file itself.

Now that we have learned how to read Excel sheets in R, in the next section, we will cover Python, where we will revisit the same concepts but from the perspective of the Python language.

Python packages for Excel manipulation

In this section, we will explore how to read Excel spreadsheets using Python. One of the key aspects of working with Excel files in Python is having the right set of packages that provide the necessary functionality. In this section, we will discuss some commonly used Python packages for Excel manipulation and highlight their advantages and considerations.

Python packages for Excel manipulation

When it comes to interacting with Excel files in Python, several packages offer a range of features and capabilities. These packages allow you to extract data from Excel files, manipulate the data, and write it back to Excel files. Let’s take a look at some popular Python packages for Excel manipulation.

pandas

pandas is a powerful data manipulation library that can read Excel files using the read_excel function. The advantage of using pandas is that it provides a DataFrame object, which allows you to manipulate the data in a tabular form. This makes it easy to perform data analysis and manipulation. pandas excels in handling large datasets efficiently and provides flexible options for data filtering, transformation, and aggregation.

openpyxl

openpyxl is a widely used library specifically designed for working with Excel files. It provides a comprehensive set of features for reading and writing Excel spreadsheets, including support for various Excel file formats and compatibility with different versions of Excel. In addition, openpyxl allows fine-grained control over the structure and content of Excel files, enabling tasks such as accessing individual cells, creating new worksheets, and applying formatting.

xlrd and xlwt

xlrd and xlwt are older libraries that are still in use for reading and writing Excel files, particularly with legacy formats such as .xls. xlrd enables reading data from Excel files, while xlwt facilitates writing data to Excel files. These libraries are lightweight and straightforward to use, but they lack some of the advanced features provided by pandas and openpyxl.

Considerations

When choosing a Python package for Excel manipulation, it’s essential to consider the specific requirements of your project. Here are a few factors to keep in mind:

Functionality: Evaluate the package’s capabilities and ensure it meets your needs for reading Excel files. Consider whether you require advanced data manipulation features or if a simpler package will suffice.
Performance: If you’re working with large datasets or need efficient processing, packages such as pandas, which have optimized algorithms, can offer significant performance advantages.
Compatibility: Check the compatibility of the package with different Excel file formats and versions. Ensure that it supports the specific format you are working with to avoid any compatibility issues.
Learning curve: Consider the learning curve associated with each package. Some packages, such as pandas, have a more extensive range of functionality, but they may require additional time and effort to master.

Each package offers unique features and has its strengths and weaknesses, allowing you to read Excel spreadsheets effectively in Python. For example, if you need to read and manipulate large amounts of data, pandas may be the better choice. However, if you need fine-grained control over the Excel file, openpyxl will likely fit your needs better.

Consider the specific requirements of your project, such as data size, functionality, and compatibility, to choose the most suitable package for your needs. In the following sections, we will delve deeper into how to utilize these packages to read and extract data from Excel files using Python.

Opening an Excel sheet from Python and reading the data

When working with Excel files in Python, it’s common to need to open a specific sheet and read the data into Python for further analysis. This can be achieved using popular libraries such as pandas and openpyxl, as discussed in the previous section.

You can most likely use other Python and package versions, but the code in this section has not been tested with anything other than what we’ve stated here.

Using pandas

pandas is a powerful data manipulation library that simplifies the process of working with structured data, including Excel spreadsheets. To read an Excel sheet using pandas, you can use the read_excel function. Let’s consider an example of using the iris_data.xlsx file with a sheet named setosa:

import pandas as pd
# Read the Excel file
df = pd.read_excel('iris_data.xlsx', sheet_name='setosa')
# Display the first few rows of the DataFrame
print(df.head())

You will need to run this code either with the Python working directory set to the location where the Excel file is located, or you will need to provide the full path to the file in the read_excel() command:

Figure 1.6 – Using the pandas package to read the Excel file

In the preceding code snippet, we imported the pandas library and utilized the read_excel function to read setosa from the iris_data.xlsx file. The resulting data is stored in a pandas DataFrame, which provides a tabular representation of the data. By calling head() on the DataFrame, we displayed the first few rows of the data, giving us a quick preview.

Using openpyxl

openpyxl is a powerful library for working with Excel files, offering more granular control over individual cells and sheets. To open an Excel sheet and access its data using openpyxl, we can utilize the load_workbook function. Please note that openpyxl cannot handle .xls files, only the more modern .xlsx and .xlsm versions.

Let’s consider an example of using the iris_data.xlsx file with a sheet named versicolor:

import openpyxl
import pandas as pd
# Load the workbook
wb = openpyxl.load_workbook('iris_data.xlsx')
# Select the sheet
sheet = wb['versicolor']
# Extract the values (including header)
sheet_data_raw = sheet.values
# Separate the headers into a variable
header = next(sheet_data_raw)[0:]
# Create a DataFrame based on the second and subsequent lines of data with the header as column names
sheet_data = pd.DataFrame(sheet_data_raw, columns=header)
print(sheet_data.head())

The preceding code results in the following output:

Figure 1.7 – Using the openpyxl package to read the Excel file

In this code snippet, we import the load_workbook function from the openpyxl library. Then, we load the workbook by providing the iris_data.xlsx filename. Next, we select the desired sheet by accessing it using its name – in this case, this is versicolor. Once we’ve done this, we read the raw data using the values property of the loaded sheet object. This is a generator and can be accessed via a for cycle or by converting it into a list or a DataFrame, for example. In this example, we have converted it into a pandas DataFrame because it is the format that is the most comfortable to work with later.

Both pandas and openpyxl offer valuable features for working with Excel files in Python. While pandas simplifies data manipulation with its DataFrame structure, openpyxl provides more fine-grained control over individual cells and sheets. Depending on your specific requirements, you can choose the library that best suits your needs.

By mastering the techniques of opening Excel sheets and reading data into Python, you will be able to extract valuable insights from your Excel data, perform various data transformations, and prepare it for further analysis or visualization. These skills are essential for anyone seeking to leverage the power of Python and Excel in their data-driven workflows.

Reading in multiple sheets with Python (openpyxl and custom functions)

In many Excel files, it’s common to have multiple sheets containing different sets of data. Being able to read in multiple sheets and consolidate the data into a single data structure can be highly valuable for analysis and processing. In this section, we will explore how to achieve this using the openpyxl library and a custom function.

The importance of reading multiple sheets

When working with complex Excel files, it’s not uncommon to encounter scenarios where related data is spread across different sheets. For example, you may have one sheet for sales data, another for customer information, and yet another for product inventory. By reading in multiple sheets and consolidating the data, you can gain a holistic view and perform a comprehensive analysis.

Let’s start by examining the basic steps involved in reading in multiple sheets:

Load the workbook: Before accessing the sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl.
Get the sheet names: We can obtain the names of all the sheets in the workbook using the sheetnames attribute. This allows us to identify the sheets we want to read.
Read data from each sheet: By iterating over the sheet names, we can access each sheet individually and read the data. Openpyxl provides methods such as iter_rows or iter_cols to traverse the cells of each sheet and retrieve the desired data.
Store the data: To consolidate the data from multiple sheets, we can use a suitable data structure, such as a pandas DataFrame or a Python list. As we read the data from each sheet, we concatenate or merge it into the consolidated data structure:
- If the data in all sheets follows the same format (as is the case in the example used in this chapter), we can simply concatenate the datasets
- However, if the datasets have different structures because they describe different aspects of a dataset (for example, one sheet contains product information, the next contains customer data, and the third contains the sales of the products to the customers), then we can merge these datasets based on unique identifiers to create a comprehensive dataset

Using openpyxl to access sheets

openpyxl is a powerful library that allows us to interact with Excel files using Python. It provides a wide range of functionalities, including accessing and manipulating multiple sheets. Before we dive into the details, let’s take a moment to understand why openpyxl is a popular choice for this task.

One of the primary advantages of openpyxl is its ability to handle various Excel file formats, such as .xlsx and .xlsm. This flexibility allows us to work with different versions of Excel files without compatibility issues. Additionally, openpyxl provides a straightforward and intuitive interface to access sheet data, making it easier for us to retrieve the desired information.

Reading data from each sheet

To begin reading in multiple sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl. This function takes the file path as input and returns a workbook object that represents the entire Excel file.

Once we have loaded the workbook, we can retrieve the names of all the sheets using the sheetnames attribute. This gives us a list of sheet names present in the Excel file. We can then iterate over these sheet names to read the data from each sheet individually.

Retrieving sheet data with openpyxl

openpyxl provides various methods to access the data within a sheet.

Two commonly used methods are iter_rows and iter_cols. These methods allow us to iterate over the rows or columns of a sheet and retrieve the cell values.

Let’s have a look at how iter_rows can be used:

# Assuming you are working with the first sheet
sheet = wb['versicolor']
# Iterate over rows and print raw values
for row in sheet.iter_rows(min_row=1, max_row=5, values_only=True):
    print(row)

Similarly, iter_cols can be used like this:

# Iterate over columns and print raw values
for column in sheet.iter_cols(min_col=1, max_col=5, values_only=True):
    print(column)

When using iter_rows or iter_cols, we can specify whether we want to retrieve the cell values as raw values or as formatted values. Raw values give us the actual data stored in the cells, while formatted values include any formatting applied to the cells, such as date formatting or number formatting.

By iterating over the rows or columns of a sheet, we can retrieve the desired data and store it in a suitable data structure. One popular choice is to use pandas DataFrame, which provide a tabular representation of the data and offer convenient methods for data manipulation and analysis.

An alternative solution is using the values attribute of the sheet object. This provides a generator for all values in the sheet (much like iter_rows and iter_cols do for rows and columns, respectively). While generators cannot be used directly to access the data, they can be used in for cycles to iterate over each value. The pandas library’s DataFrame function also allows direct conversion from a suitable generator object to a DataFrame.

Combining data from multiple sheets

As we read the data from each sheet, we can store it in a list or dictionary, depending on our needs. Once we have retrieved the data from all the sheets, we can combine it into a single consolidated data structure. This step is crucial for further analysis and processing.

To combine the data, we can use pandas DataFrame. By creating individual DataFrame for each sheet’s data and then concatenating or merging them into a single DataFrame, we can obtain a comprehensive dataset that includes all the information from multiple sheets.

Custom function for reading multiple sheets

To simplify the process of reading in multiple sheets and consolidating the data, we can create custom functions tailored to our specific requirements. These functions encapsulate the necessary steps and allow us to reuse the code efficiently.

In our example, we define a function called read_multiple_sheets that takes the file path as input. Inside the function, we load the workbook using load_workbook and iterate over the sheet names retrieved with the sheets attribute.

For each sheet, we access it using the workbook object and retrieve the data using the custom read_single_sheet function. We then store the retrieved data in a list. Finally, we combine the data from all the sheets into a single pandas DataFrame using the appropriate concatenation method from pandas.

By using these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset that’s ready for analysis. The function provides a reusable and efficient solution, saving us time and effort in dealing with complex Excel files.

Customizing the code

The provided example is a starting point that you can customize based on your specific requirements. Here are a few considerations for customizing the code:

Filtering columns: If you only need specific columns from each sheet, you can modify the code to extract only the desired columns during the data retrieval step. You can do this by using the iter_cols method instead of the values attribute and using a filtered list in a for cycle or by filtering the resulting pandas DataFrame object(s).
Handling missing data: If the sheets contain missing data, you can incorporate appropriate handling techniques, such as filling in missing values or excluding incomplete rows.
Applying transformations: Depending on the nature of your data, you might need to apply transformations or calculations to the consolidated dataset. The custom function can be expanded to accommodate these transformations.

Remember, the goal is to tailor the code to suit your unique needs and ensure it aligns with your data processing requirements.

By leveraging the power of openpyxl and creating custom functions, you can efficiently read in multiple sheets from Excel files, consolidate the data, and prepare it for further analysis. This capability enables you to unlock valuable insights from complex Excel files and leverage the full potential of your data.

Now, let’s dive into an example that demonstrates this process:

from openpyxl import load_workbook
import pandas as pd
def read_single_sheet(workbook, sheet_name):
   # Load the sheet from the workbook
    sheet = workbook[sheet_name]
    # Read out the raaw data including headers
    sheet_data_raw = sheet.values
    # Separate the headers into a variable
    columns = next(sheet_data_raw)[0:]
    # Create a DataFrame based on the second and subsequent lines of data with the header as column names and return it
    return pd.DataFrame(sheet_data_raw, columns=columns)
def read_multiple_sheets(file_path):
    # Load the workbook
    workbook = load_workbook(file_path)
    # Get a list of all sheet names in the workbook
    sheet_names = workbook.sheetnames
    # Cycle through the sheet names, load the data for each and concatenate them into a single DataFrame
    return pd.concat([read_single_sheet(workbook=workbook, sheet_name=sheet_name) for sheet_name in sheet_names], ignore_index=True)
# Define the file path and sheet names
file_path = 'iris_data.xlsx' # adjust the path as needed
# Read the data from multiple sheets
consolidated_data = read_multiple_sheets(file_path)
# Display the consolidated data
print(consolidated_data.head())

Let’s have a look at the results:

Figure 1.8 – Using the openxlsx package to read in the Excel file

In the preceding code, we define two functions:

read_single_sheet: This reads the data from a single sheet into a pandas DataFrame
read_multiple_sheets: This reads and concatenates the data from all sheets in the workbook

Within the read_multiple_sheets function, we load the workbook using load_workbook and iterate over the sheet names. For each sheet, we retrieve the data using the read_single_sheet helper function, which reads the data from a sheet and creates a pandas DataFrame for each sheet’s data, with the header row used as column names. Finally, we use pd.concat to combine all the DataFrame into a single consolidated DataFrame.

By utilizing these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset. This allows us to perform various data manipulations, analyses, or visualizations on the combined data.

Understanding how to handle multiple sheets efficiently enhances our ability to work with complex Excel files and extract valuable insights from diverse datasets.

Extending Excel with Python and R: Unlock the potential of analytics languages for advanced data manipulation and visualization

What do you get with eBook?

Product Details

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (2)

FAQs