You're reading from Extending Excel with Python and R

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781804610695

Edition1st Edition

Concepts

Data Analysis

Authors (2):

Steven Sanderson

David Kun

View More author details

Reading in multiple sheets with Python (openpyxl and custom functions)

In many Excel files, it’s common to have multiple sheets containing different sets of data. Being able to read in multiple sheets and consolidate the data into a single data structure can be highly valuable for analysis and processing. In this section, we will explore how to achieve this using the openpyxl library and a custom function.

The importance of reading multiple sheets

When working with complex Excel files, it’s not uncommon to encounter scenarios where related data is spread across different sheets. For example, you may have one sheet for sales data, another for customer information, and yet another for product inventory. By reading in multiple sheets and consolidating the data, you can gain a holistic view and perform a comprehensive analysis.

Let’s start by examining the basic steps involved in reading in multiple sheets:

Load the workbook: Before accessing the sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl.
Get the sheet names: We can obtain the names of all the sheets in the workbook using the sheetnames attribute. This allows us to identify the sheets we want to read.
Read data from each sheet: By iterating over the sheet names, we can access each sheet individually and read the data. Openpyxl provides methods such as iter_rows or iter_cols to traverse the cells of each sheet and retrieve the desired data.
Store the data: To consolidate the data from multiple sheets, we can use a suitable data structure, such as a pandas DataFrame or a Python list. As we read the data from each sheet, we concatenate or merge it into the consolidated data structure:
- If the data in all sheets follows the same format (as is the case in the example used in this chapter), we can simply concatenate the datasets
- However, if the datasets have different structures because they describe different aspects of a dataset (for example, one sheet contains product information, the next contains customer data, and the third contains the sales of the products to the customers), then we can merge these datasets based on unique identifiers to create a comprehensive dataset

Using openpyxl to access sheets

openpyxl is a powerful library that allows us to interact with Excel files using Python. It provides a wide range of functionalities, including accessing and manipulating multiple sheets. Before we dive into the details, let’s take a moment to understand why openpyxl is a popular choice for this task.

One of the primary advantages of openpyxl is its ability to handle various Excel file formats, such as .xlsx and .xlsm. This flexibility allows us to work with different versions of Excel files without compatibility issues. Additionally, openpyxl provides a straightforward and intuitive interface to access sheet data, making it easier for us to retrieve the desired information.

Reading data from each sheet

To begin reading in multiple sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl. This function takes the file path as input and returns a workbook object that represents the entire Excel file.

Once we have loaded the workbook, we can retrieve the names of all the sheets using the sheetnames attribute. This gives us a list of sheet names present in the Excel file. We can then iterate over these sheet names to read the data from each sheet individually.

Retrieving sheet data with openpyxl

openpyxl provides various methods to access the data within a sheet.

Two commonly used methods are iter_rows and iter_cols. These methods allow us to iterate over the rows or columns of a sheet and retrieve the cell values.

Let’s have a look at how iter_rows can be used:

# Assuming you are working with the first sheet
sheet = wb['versicolor']
# Iterate over rows and print raw values
for row in sheet.iter_rows(min_row=1, max_row=5, values_only=True):
    print(row)

Similarly, iter_cols can be used like this:

# Iterate over columns and print raw values
for column in sheet.iter_cols(min_col=1, max_col=5, values_only=True):
    print(column)

When using iter_rows or iter_cols, we can specify whether we want to retrieve the cell values as raw values or as formatted values. Raw values give us the actual data stored in the cells, while formatted values include any formatting applied to the cells, such as date formatting or number formatting.

By iterating over the rows or columns of a sheet, we can retrieve the desired data and store it in a suitable data structure. One popular choice is to use pandas DataFrame, which provide a tabular representation of the data and offer convenient methods for data manipulation and analysis.

An alternative solution is using the values attribute of the sheet object. This provides a generator for all values in the sheet (much like iter_rows and iter_cols do for rows and columns, respectively). While generators cannot be used directly to access the data, they can be used in for cycles to iterate over each value. The pandas library’s DataFrame function also allows direct conversion from a suitable generator object to a DataFrame.

Combining data from multiple sheets

As we read the data from each sheet, we can store it in a list or dictionary, depending on our needs. Once we have retrieved the data from all the sheets, we can combine it into a single consolidated data structure. This step is crucial for further analysis and processing.

To combine the data, we can use pandas DataFrame. By creating individual DataFrame for each sheet’s data and then concatenating or merging them into a single DataFrame, we can obtain a comprehensive dataset that includes all the information from multiple sheets.

Custom function for reading multiple sheets

To simplify the process of reading in multiple sheets and consolidating the data, we can create custom functions tailored to our specific requirements. These functions encapsulate the necessary steps and allow us to reuse the code efficiently.

In our example, we define a function called read_multiple_sheets that takes the file path as input. Inside the function, we load the workbook using load_workbook and iterate over the sheet names retrieved with the sheets attribute.

For each sheet, we access it using the workbook object and retrieve the data using the custom read_single_sheet function. We then store the retrieved data in a list. Finally, we combine the data from all the sheets into a single pandas DataFrame using the appropriate concatenation method from pandas.

By using these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset that’s ready for analysis. The function provides a reusable and efficient solution, saving us time and effort in dealing with complex Excel files.

Customizing the code

The provided example is a starting point that you can customize based on your specific requirements. Here are a few considerations for customizing the code:

Filtering columns: If you only need specific columns from each sheet, you can modify the code to extract only the desired columns during the data retrieval step. You can do this by using the iter_cols method instead of the values attribute and using a filtered list in a for cycle or by filtering the resulting pandas DataFrame object(s).
Handling missing data: If the sheets contain missing data, you can incorporate appropriate handling techniques, such as filling in missing values or excluding incomplete rows.
Applying transformations: Depending on the nature of your data, you might need to apply transformations or calculations to the consolidated dataset. The custom function can be expanded to accommodate these transformations.

Remember, the goal is to tailor the code to suit your unique needs and ensure it aligns with your data processing requirements.

By leveraging the power of openpyxl and creating custom functions, you can efficiently read in multiple sheets from Excel files, consolidate the data, and prepare it for further analysis. This capability enables you to unlock valuable insights from complex Excel files and leverage the full potential of your data.

Now, let’s dive into an example that demonstrates this process:

from openpyxl import load_workbook
import pandas as pd
def read_single_sheet(workbook, sheet_name):
   # Load the sheet from the workbook
    sheet = workbook[sheet_name]
    # Read out the raaw data including headers
    sheet_data_raw = sheet.values
    # Separate the headers into a variable
    columns = next(sheet_data_raw)[0:]
    # Create a DataFrame based on the second and subsequent lines of data with the header as column names and return it
    return pd.DataFrame(sheet_data_raw, columns=columns)
def read_multiple_sheets(file_path):
    # Load the workbook
    workbook = load_workbook(file_path)
    # Get a list of all sheet names in the workbook
    sheet_names = workbook.sheetnames
    # Cycle through the sheet names, load the data for each and concatenate them into a single DataFrame
    return pd.concat([read_single_sheet(workbook=workbook, sheet_name=sheet_name) for sheet_name in sheet_names], ignore_index=True)
# Define the file path and sheet names
file_path = 'iris_data.xlsx' # adjust the path as needed
# Read the data from multiple sheets
consolidated_data = read_multiple_sheets(file_path)
# Display the consolidated data
print(consolidated_data.head())

Let’s have a look at the results:

Figure 1.8 – Using the openxlsx package to read in the Excel file

In the preceding code, we define two functions:

read_single_sheet: This reads the data from a single sheet into a pandas DataFrame
read_multiple_sheets: This reads and concatenates the data from all sheets in the workbook

Within the read_multiple_sheets function, we load the workbook using load_workbook and iterate over the sheet names. For each sheet, we retrieve the data using the read_single_sheet helper function, which reads the data from a sheet and creates a pandas DataFrame for each sheet’s data, with the header row used as column names. Finally, we use pd.concat to combine all the DataFrame into a single consolidated DataFrame.

By utilizing these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset. This allows us to perform various data manipulations, analyses, or visualizations on the combined data.

Understanding how to handle multiple sheets efficiently enhances our ability to work with complex Excel files and extract valuable insights from diverse datasets.

You have been reading a chapter from

Extending Excel with Python and R

Published in: Apr 2024Publisher: PacktISBN-13: 9781804610695

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Authors (2)

Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages