Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Extending Excel with Python and R
Extending Excel with Python and R

Extending Excel with Python and R: Unlock the potential of analytics languages for advanced data manipulation and visualization

By Steven Sanderson , David Kun
Mex$738.99 Mex$516.99
Book Apr 2024 344 pages 1st Edition
eBook
Mex$738.99 Mex$516.99
Print
Mex$922.99
Subscription
Free Trial
eBook
Mex$738.99 Mex$516.99
Print
Mex$922.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 30, 2024
Length 344 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781804610695
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Extending Excel with Python and R

Reading Excel Spreadsheets

In the deep and wide landscape of data analysis, Excel stands tall and by your side as a trusted warrior, simplifying the process of organizing, calculating, and presenting information. Its intuitive interface and widespread usage have cemented its position as a staple in the business world. However, as the volume and complexity of data continue to grow exponentially, Excel’s capabilities may start to feel constrained. It is precisely at this point that the worlds of Excel, R, and Python converge. Extending Excel with R and Python invites you to embark on a truly transformative journey. This trip will show you the power of these programming languages as they synergize with Excel, expanding its horizons and empowering you to conquer data challenges with ease. In this book, we will delve into how to integrate Excel with R and Python, uncovering the hidden potential that lies beneath the surface and enabling you to extract valuable insights, automate processes, and unleash the true power of data analysis.

Microsoft Excel came to market in 1985 and has remained a popular spreadsheet software choice. Excel was originally known as MultiPlan. Microsoft Excel and databases in general share some similarities in terms of organizing and managing data, although they serve different purposes. Excel is a spreadsheet program that allows users to store and manipulate data in a tabular format. It consists of rows and columns, where each cell can contain text, numbers, or formulas. Similarly, a database is a structured collection of data stored in tables, consisting of rows and columns.

Both Excel and databases provide a way to store and retrieve data. In Excel, you can enter data, perform calculations, and create charts and graphs. Similarly, databases store and manage large amounts of structured data and enable querying, sorting, and filtering. Excel and databases also support the concept of relationships. In Excel, you can link cells or ranges across different sheets, creating connections between data. Databases use relationships to link tables based on common fields, allowing you to retrieve related data from multiple tables.

This chapter aims to familiarize you with reading Excel files into the R environment and performing some manipulation on them. Specifically, in this chapter, we’re going to cover the following main topics:

  • R packages for Excel manipulation
  • Reading Excel files to manipulate with R
  • Reading multiple Excel sheets with a custom R function
  • Python packages for Excel manipulation
  • Opening an Excel sheet from Python and reading the data

Technical requirements

At the time of writing, we are using the following:

  • R 4.2.1
  • The RStudio 2023.03.1+446 “Cherry Blossom” release for Windows

For this chapter, you will need to install the following packages:

  • readxl
  • openxlsx
  • xlsx

To run the Python code in this chapter, we will be using the following:

  • Python 3.11
  • pandas
  • openpyxl
  • The iris.xlsx Excel file available in this book’s GitHub repository

While setting up a Python environment is outside the scope of this book, this is easy to do. The necessary packages can be installed by running the following commands:

python -m pip install pandas==2.0.1
python -m pip install openpyxl==3.1.2

Note that these commands have to be run from a terminal and not from within a Python script. They need to be run in the folder where requirements.txt resides or a full path to the requirements.txt file has to be included.

This book’s GitHub repository also contains a requirements.txt file that you can use to install all dependencies. You can do this by running the following command:

python -m pip install -r requirements.txt

This command installs all the packages that will be used in this chapter so that you don’t have to install them one by one. It also guarantees that the whole dependency tree (including the dependencies of the dependencies) will be the same as what this book’s authors have used.

Alternatively, when using Jupyter Notebooks, you can use the following magic commands:

%pip install pandas==2.0.1
%pip install openpyxl==3.1.2

There is a GitHub account for all of the code examples in this book located at this link: https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R. Each chapter has it’s own folder, with the current one as Chapter01.

Note

Technical requirements for Python throughout the book are conveniently compiled in the requirements.txt file, accessible on GitHub repository here, https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R/blob/main/requirements.txt. Installing these dependencies will streamline your coding experience and ensure smooth progression through the book. Be sure to install them all before diving into the exercises.

Working with R packages for Excel manipulation

There are several packages available both on CRAN and on GitHub that allow for reading and manipulation of Excel files. In this section, we are specifically going to focus on the packages: readxl, openxlsx, and xlsx to read Excel files. These three packages all have their own functions to read Excel files. These functions are as follows:

  • readxl::read_excel()
  • openxlsx::read.xlsx()
  • xlsx::read.xlsx()

Each function has a set of parameters and conventions to follow. Since readxl is part of the tidyverse collection of packages, it follows its conventions and returns a tibble object upon reading the file. If you do not know what a tibble is, it is a modern version of R’s data.frame, a sort of spreadsheet in the R environment. It is the building block of most analyses. Moving on to openxlsx and xlsx, they both return a base R data.frame object, with the latter also able to return a list object. If you are wondering how this relates to manipulating an actual Excel file, I can explain. First, to manipulate something in R, the data must be in the R environment, so you cannot manipulate the file unless the data is read in. These packages have different functions for manipulating Excel or reading data in certain ways that allow for further analysis and or manipulation. It is important to note that xlsx does require Java to be installed.

As we transition from our exploration of R packages for Excel manipulation, we’ll turn our attention to the crucial task of effectively reading Excel files into R, thereby unlocking even more possibilities for data analysis and manipulation.

Reading Excel files to R

In this section, we are going to read data from Excel with a few different R libraries. We need to do this before we can even consider performing any type of manipulation or analysis on the data contained in the sheets of the Excel files.

As mentioned in the Technical requirements section, we are going to be using the readxl, openxlsx, and xlsx packages to read data into R.

Installing and loading libraries

In this section, we are going to install and load the necessary libraries if you do not yet have them. We are going to use the openxlsx, xlsx, readxl, and readxlsb libraries. To install and load them, run the following code block:

pkgs <- c("openxlsx", "xlsx", "readxl")
install.packages(pkgs, dependencies = TRUE)
lapply(pkgs, library, character.only = TRUE)

The lapply() function in R is a versatile tool for applying a function to each element of a list, vector, or DataFrame. It takes two arguments, x and FUN, where x is the list and FUN is the function that is applied to the list object, x.

Now that the libraries have been installed, we can get to work. To do this, we are going to read a spreadsheet built from the Iris dataset that is built into base R. We are going to read the file with three different libraries, and then we are going to create a custom function to work with the readxl library that will read all the sheets of an Excel file. We will call this the read_excel_sheets() function.

Let’s start reading the files. The first library we will use to open an Excel file is openxlsx. To read the Excel file we are working with, you can run the code in the chapter1 folder of this book’s GitHub repository called ch1_create_iris_dataset.R Refer to the following screenshot to see how to read the file into R.

You will notice a variable called f_pat. This is the path to where the Iris dataset was saved as an Excel file – for example, C:/User/UserName/Documents/iris_data.xlsx:

Figure 1.1 – Using the openxlsx package to read the Excel file

Figure 1.1 – Using the openxlsx package to read the Excel file

The preceding screenshot shows how to read an Excel file. This example assumes that you have used the ch1_create_iris_datase.R file to create the example Excel file. In reality, you can read in any Excel file that you would like or need.

Now, we will perform the same type of operation, but this time with the xlsx library. Refer to the following screenshot, which uses the same methodology as with the openxlsx package:

Figure 1.2 – Using the xlsx library and the read.xlsx() function to open the Excel file we’ve created

Figure 1.2 – Using the xlsx library and the read.xlsx() function to open the Excel file we’ve created

Finally, we will use the readxl library, which is part of the tidyverse:

Figure 1.3 – Using the readxl library and the read_excel() function to read the Excel file into memory

Figure 1.3 – Using the readxl library and the read_excel() function to read the Excel file into memory

In this section, we learned how to read in an Excel file with a few different packages. While these packages can do more than simply read in an Excel file, that is what we needed to focus on in this section. You should now be familiar with how to use the readxl::read_excel(), xlsx::read.xlsx(), and openxlsx::read.xlsx() functions.

Building upon our expertise in reading Excel files into R, we’ll now embark on the next phase of our journey: unraveling the secrets of efficiently extracting data from multiple sheets within an Excel file.

Reading multiple sheets with readxl and a custom function

In Excel, we often encounter workbooks that have multiple sheets in them. These could be stats for different months of the year, table data that follows a specific format month over month, or some other period. The point is that we may want to read all the sheets in a file for one reason or another, and we should not call the read function from a particular package for each sheet. Instead, we should use the power of R to loop through this with purrr.

Let’s build a customized function. To do this, we are going to load the readxl function. If we have it already loaded, then this is not necessary; however, if it is already installed and you do not wish to load the library into memory, then you can call the excel_sheets() function by using readxl::excel_sheets():

Figure 1.4 – Creating a function to read all the sheets into an Excel file at once – read_excel_sheets()

Figure 1.4 – Creating a function to read all the sheets into an Excel file at once – read_excel_sheets()

The new code can be broken down as follows:

 read_excel_sheets <- function(filename, single_tbl) {

This line defines a function called read_excel_sheets that takes two arguments: filename (the name of the Excel file to be read) and single_tbl (a logical value indicating whether the function should return a single table or a list of tables).

Next, we have the following line:

sheets <- readxl::excel_sheets(filename)

This line uses the readxl package to extract the names of all the sheets in the Excel file specified by filename. The sheet names are stored in the sheets variable.

Here’s the next line:

if (single_tbl) {

This line starts an if statement that checks the value of the single_tbl argument.

Now, we have the following:

x <- purrr::map_df(sheets, read_excel, path = filename)

If single_tbl is TRUE, this line uses the purrr package’s map_df function to iterate over each sheet name in sheets and read the corresponding sheet using the read_excel function from the readxl package. The resulting DataFrame are combined into a single table, which is assigned to the x variable.

Now, we have the following line:

} else {

This line indicates the start of the else block of the if statement. If single_tbl is FALSE, the code in this block will be executed.

Here’s the next line:

 x <- purrr::map(sheets, ~ readxl::read_excel(filename, sheet = .x))

In this line, the purrr package’s map function is used to iterate over each sheet name in sheets. For each sheet, the read_excel function from the readxl package is called to read the corresponding sheet from the Excel file specified by filename. The resulting DataFrame are stored in a list assigned to the x variable.

Now, we have the following:

 purrr::set_names(x, sheets)

This line uses the set_names function from the purrr package to set the names of the elements in the x list to the sheet names in sheets.

Finally, we have the following line:

 x

This line returns the value of x from the function, which will be either a single table (data.frame) if single_tbl is TRUE, or a list of tables (data.frame) if single_tbl is FALSE.

In summary, the read_excel_sheets function takes an Excel filename and a logical value indicating whether to return a single table or a list of tables. It uses the readxl package to extract the sheet names from the Excel file, and then reads the corresponding sheets either into a single table (if single_tbl is TRUE) or into a list of tables (if single_tbl is FALSE). The resulting data is returned as the output of the function. To see how this works, let’s look at the following example.

We have a spreadsheet that has four tabs in it – one for each species in the famous Iris dataset and then one sheet called iris, which is the full dataset.

As shown in Figure 1.5, the read_excel_sheets() function has read all four sheets of the Excel file. We can also see that the function has imported the sheets as a list object and has named each item in the list after the name of the corresponding tab in the Excel file. It is also important to note that the sheets must all have the same column names and structure for this to work:

Figure 1.5 – Excel file read by read_excel_sheets()

Figure 1.5 – Excel file read by read_excel_sheets()

In this section, we learned how to write a function that will read all of the sheets in any Excel file. This function will also return them as a named item list, where the names are the names of the tabs in the file itself.

Now that we have learned how to read Excel sheets in R, in the next section, we will cover Python, where we will revisit the same concepts but from the perspective of the Python language.

Python packages for Excel manipulation

In this section, we will explore how to read Excel spreadsheets using Python. One of the key aspects of working with Excel files in Python is having the right set of packages that provide the necessary functionality. In this section, we will discuss some commonly used Python packages for Excel manipulation and highlight their advantages and considerations.

Python packages for Excel manipulation

When it comes to interacting with Excel files in Python, several packages offer a range of features and capabilities. These packages allow you to extract data from Excel files, manipulate the data, and write it back to Excel files. Let’s take a look at some popular Python packages for Excel manipulation.

pandas

pandas is a powerful data manipulation library that can read Excel files using the read_excel function. The advantage of using pandas is that it provides a DataFrame object, which allows you to manipulate the data in a tabular form. This makes it easy to perform data analysis and manipulation. pandas excels in handling large datasets efficiently and provides flexible options for data filtering, transformation, and aggregation.

openpyxl

openpyxl is a widely used library specifically designed for working with Excel files. It provides a comprehensive set of features for reading and writing Excel spreadsheets, including support for various Excel file formats and compatibility with different versions of Excel. In addition, openpyxl allows fine-grained control over the structure and content of Excel files, enabling tasks such as accessing individual cells, creating new worksheets, and applying formatting.

xlrd and xlwt

xlrd and xlwt are older libraries that are still in use for reading and writing Excel files, particularly with legacy formats such as .xls. xlrd enables reading data from Excel files, while xlwt facilitates writing data to Excel files. These libraries are lightweight and straightforward to use, but they lack some of the advanced features provided by pandas and openpyxl.

Considerations

When choosing a Python package for Excel manipulation, it’s essential to consider the specific requirements of your project. Here are a few factors to keep in mind:

  • Functionality: Evaluate the package’s capabilities and ensure it meets your needs for reading Excel files. Consider whether you require advanced data manipulation features or if a simpler package will suffice.
  • Performance: If you’re working with large datasets or need efficient processing, packages such as pandas, which have optimized algorithms, can offer significant performance advantages.
  • Compatibility: Check the compatibility of the package with different Excel file formats and versions. Ensure that it supports the specific format you are working with to avoid any compatibility issues.
  • Learning curve: Consider the learning curve associated with each package. Some packages, such as pandas, have a more extensive range of functionality, but they may require additional time and effort to master.

Each package offers unique features and has its strengths and weaknesses, allowing you to read Excel spreadsheets effectively in Python. For example, if you need to read and manipulate large amounts of data, pandas may be the better choice. However, if you need fine-grained control over the Excel file, openpyxl will likely fit your needs better.

Consider the specific requirements of your project, such as data size, functionality, and compatibility, to choose the most suitable package for your needs. In the following sections, we will delve deeper into how to utilize these packages to read and extract data from Excel files using Python.

Opening an Excel sheet from Python and reading the data

When working with Excel files in Python, it’s common to need to open a specific sheet and read the data into Python for further analysis. This can be achieved using popular libraries such as pandas and openpyxl, as discussed in the previous section.

You can most likely use other Python and package versions, but the code in this section has not been tested with anything other than what we’ve stated here.

Using pandas

pandas is a powerful data manipulation library that simplifies the process of working with structured data, including Excel spreadsheets. To read an Excel sheet using pandas, you can use the read_excel function. Let’s consider an example of using the iris_data.xlsx file with a sheet named setosa:

import pandas as pd
# Read the Excel file
df = pd.read_excel('iris_data.xlsx', sheet_name='setosa')
# Display the first few rows of the DataFrame
print(df.head())

You will need to run this code either with the Python working directory set to the location where the Excel file is located, or you will need to provide the full path to the file in the read_excel() command:

Figure 1.6 – Using the pandas package to read the Excel file

Figure 1.6 – Using the pandas package to read the Excel file

In the preceding code snippet, we imported the pandas library and utilized the read_excel function to read setosa from the iris_data.xlsx file. The resulting data is stored in a pandas DataFrame, which provides a tabular representation of the data. By calling head() on the DataFrame, we displayed the first few rows of the data, giving us a quick preview.

Using openpyxl

openpyxl is a powerful library for working with Excel files, offering more granular control over individual cells and sheets. To open an Excel sheet and access its data using openpyxl, we can utilize the load_workbook function. Please note that openpyxl cannot handle .xls files, only the more modern .xlsx and .xlsm versions.

Let’s consider an example of using the iris_data.xlsx file with a sheet named versicolor:

import openpyxl
import pandas as pd
# Load the workbook
wb = openpyxl.load_workbook('iris_data.xlsx')
# Select the sheet
sheet = wb['versicolor']
# Extract the values (including header)
sheet_data_raw = sheet.values
# Separate the headers into a variable
header = next(sheet_data_raw)[0:]
# Create a DataFrame based on the second and subsequent lines of data with the header as column names
sheet_data = pd.DataFrame(sheet_data_raw, columns=header)
print(sheet_data.head())

The preceding code results in the following output:

Figure 1.7 – Using the openpyxl package to read the Excel file

Figure 1.7 – Using the openpyxl package to read the Excel file

In this code snippet, we import the load_workbook function from the openpyxl library. Then, we load the workbook by providing the iris_data.xlsx filename. Next, we select the desired sheet by accessing it using its name – in this case, this is versicolor. Once we’ve done this, we read the raw data using the values property of the loaded sheet object. This is a generator and can be accessed via a for cycle or by converting it into a list or a DataFrame, for example. In this example, we have converted it into a pandas DataFrame because it is the format that is the most comfortable to work with later.

Both pandas and openpyxl offer valuable features for working with Excel files in Python. While pandas simplifies data manipulation with its DataFrame structure, openpyxl provides more fine-grained control over individual cells and sheets. Depending on your specific requirements, you can choose the library that best suits your needs.

By mastering the techniques of opening Excel sheets and reading data into Python, you will be able to extract valuable insights from your Excel data, perform various data transformations, and prepare it for further analysis or visualization. These skills are essential for anyone seeking to leverage the power of Python and Excel in their data-driven workflows.

Reading in multiple sheets with Python (openpyxl and custom functions)

In many Excel files, it’s common to have multiple sheets containing different sets of data. Being able to read in multiple sheets and consolidate the data into a single data structure can be highly valuable for analysis and processing. In this section, we will explore how to achieve this using the openpyxl library and a custom function.

The importance of reading multiple sheets

When working with complex Excel files, it’s not uncommon to encounter scenarios where related data is spread across different sheets. For example, you may have one sheet for sales data, another for customer information, and yet another for product inventory. By reading in multiple sheets and consolidating the data, you can gain a holistic view and perform a comprehensive analysis.

Let’s start by examining the basic steps involved in reading in multiple sheets:

  1. Load the workbook: Before accessing the sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl.
  2. Get the sheet names: We can obtain the names of all the sheets in the workbook using the sheetnames attribute. This allows us to identify the sheets we want to read.
  3. Read data from each sheet: By iterating over the sheet names, we can access each sheet individually and read the data. Openpyxl provides methods such as iter_rows or iter_cols to traverse the cells of each sheet and retrieve the desired data.
  4. Store the data: To consolidate the data from multiple sheets, we can use a suitable data structure, such as a pandas DataFrame or a Python list. As we read the data from each sheet, we concatenate or merge it into the consolidated data structure:
    • If the data in all sheets follows the same format (as is the case in the example used in this chapter), we can simply concatenate the datasets
    • However, if the datasets have different structures because they describe different aspects of a dataset (for example, one sheet contains product information, the next contains customer data, and the third contains the sales of the products to the customers), then we can merge these datasets based on unique identifiers to create a comprehensive dataset

Using openpyxl to access sheets

openpyxl is a powerful library that allows us to interact with Excel files using Python. It provides a wide range of functionalities, including accessing and manipulating multiple sheets. Before we dive into the details, let’s take a moment to understand why openpyxl is a popular choice for this task.

One of the primary advantages of openpyxl is its ability to handle various Excel file formats, such as .xlsx and .xlsm. This flexibility allows us to work with different versions of Excel files without compatibility issues. Additionally, openpyxl provides a straightforward and intuitive interface to access sheet data, making it easier for us to retrieve the desired information.

Reading data from each sheet

To begin reading in multiple sheets, we need to load the Excel workbook using the load_workbook function provided by openpyxl. This function takes the file path as input and returns a workbook object that represents the entire Excel file.

Once we have loaded the workbook, we can retrieve the names of all the sheets using the sheetnames attribute. This gives us a list of sheet names present in the Excel file. We can then iterate over these sheet names to read the data from each sheet individually.

Retrieving sheet data with openpyxl

openpyxl provides various methods to access the data within a sheet.

Two commonly used methods are iter_rows and iter_cols. These methods allow us to iterate over the rows or columns of a sheet and retrieve the cell values.

Let’s have a look at how iter_rows can be used:

# Assuming you are working with the first sheet
sheet = wb['versicolor']
# Iterate over rows and print raw values
for row in sheet.iter_rows(min_row=1, max_row=5, values_only=True):
    print(row)

Similarly, iter_cols can be used like this:

# Iterate over columns and print raw values
for column in sheet.iter_cols(min_col=1, max_col=5, values_only=True):
    print(column)

When using iter_rows or iter_cols, we can specify whether we want to retrieve the cell values as raw values or as formatted values. Raw values give us the actual data stored in the cells, while formatted values include any formatting applied to the cells, such as date formatting or number formatting.

By iterating over the rows or columns of a sheet, we can retrieve the desired data and store it in a suitable data structure. One popular choice is to use pandas DataFrame, which provide a tabular representation of the data and offer convenient methods for data manipulation and analysis.

An alternative solution is using the values attribute of the sheet object. This provides a generator for all values in the sheet (much like iter_rows and iter_cols do for rows and columns, respectively). While generators cannot be used directly to access the data, they can be used in for cycles to iterate over each value. The pandas library’s DataFrame function also allows direct conversion from a suitable generator object to a DataFrame.

Combining data from multiple sheets

As we read the data from each sheet, we can store it in a list or dictionary, depending on our needs. Once we have retrieved the data from all the sheets, we can combine it into a single consolidated data structure. This step is crucial for further analysis and processing.

To combine the data, we can use pandas DataFrame. By creating individual DataFrame for each sheet’s data and then concatenating or merging them into a single DataFrame, we can obtain a comprehensive dataset that includes all the information from multiple sheets.

Custom function for reading multiple sheets

To simplify the process of reading in multiple sheets and consolidating the data, we can create custom functions tailored to our specific requirements. These functions encapsulate the necessary steps and allow us to reuse the code efficiently.

In our example, we define a function called read_multiple_sheets that takes the file path as input. Inside the function, we load the workbook using load_workbook and iterate over the sheet names retrieved with the sheets attribute.

For each sheet, we access it using the workbook object and retrieve the data using the custom read_single_sheet function. We then store the retrieved data in a list. Finally, we combine the data from all the sheets into a single pandas DataFrame using the appropriate concatenation method from pandas.

By using these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset that’s ready for analysis. The function provides a reusable and efficient solution, saving us time and effort in dealing with complex Excel files.

Customizing the code

The provided example is a starting point that you can customize based on your specific requirements. Here are a few considerations for customizing the code:

  • Filtering columns: If you only need specific columns from each sheet, you can modify the code to extract only the desired columns during the data retrieval step. You can do this by using the iter_cols method instead of the values attribute and using a filtered list in a for cycle or by filtering the resulting pandas DataFrame object(s).
  • Handling missing data: If the sheets contain missing data, you can incorporate appropriate handling techniques, such as filling in missing values or excluding incomplete rows.
  • Applying transformations: Depending on the nature of your data, you might need to apply transformations or calculations to the consolidated dataset. The custom function can be expanded to accommodate these transformations.

Remember, the goal is to tailor the code to suit your unique needs and ensure it aligns with your data processing requirements.

By leveraging the power of openpyxl and creating custom functions, you can efficiently read in multiple sheets from Excel files, consolidate the data, and prepare it for further analysis. This capability enables you to unlock valuable insights from complex Excel files and leverage the full potential of your data.

Now, let’s dive into an example that demonstrates this process:

from openpyxl import load_workbook
import pandas as pd
def read_single_sheet(workbook, sheet_name):
   # Load the sheet from the workbook
    sheet = workbook[sheet_name]
    # Read out the raaw data including headers
    sheet_data_raw = sheet.values
    # Separate the headers into a variable
    columns = next(sheet_data_raw)[0:]
    # Create a DataFrame based on the second and subsequent lines of data with the header as column names and return it
    return pd.DataFrame(sheet_data_raw, columns=columns)
def read_multiple_sheets(file_path):
    # Load the workbook
    workbook = load_workbook(file_path)
    # Get a list of all sheet names in the workbook
    sheet_names = workbook.sheetnames
    # Cycle through the sheet names, load the data for each and concatenate them into a single DataFrame
    return pd.concat([read_single_sheet(workbook=workbook, sheet_name=sheet_name) for sheet_name in sheet_names], ignore_index=True)
# Define the file path and sheet names
file_path = 'iris_data.xlsx' # adjust the path as needed
# Read the data from multiple sheets
consolidated_data = read_multiple_sheets(file_path)
# Display the consolidated data
print(consolidated_data.head())

Let’s have a look at the results:

Figure 1.8 – Using the openxlsx package to read in the Excel file

Figure 1.8 – Using the openxlsx package to read in the Excel file

In the preceding code, we define two functions:

  • read_single_sheet: This reads the data from a single sheet into a pandas DataFrame
  • read_multiple_sheets: This reads and concatenates the data from all sheets in the workbook

Within the read_multiple_sheets function, we load the workbook using load_workbook and iterate over the sheet names. For each sheet, we retrieve the data using the read_single_sheet helper function, which reads the data from a sheet and creates a pandas DataFrame for each sheet’s data, with the header row used as column names. Finally, we use pd.concat to combine all the DataFrame into a single consolidated DataFrame.

By utilizing these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset. This allows us to perform various data manipulations, analyses, or visualizations on the combined data.

Understanding how to handle multiple sheets efficiently enhances our ability to work with complex Excel files and extract valuable insights from diverse datasets.

Summary

In this chapter, we explored the process of importing data from Excel spreadsheets into our programming environments. For R users, we delved into the functionalities of libraries such as readxl, xlsx, and openxlsx, providing efficient solutions for extracting and manipulating data. We also introduced a custom function, read_excel_sheets, to streamline the process of extracting data from multiple sheets within Excel files. On the Python side, we discussed the essential pandas and openpyxl packages for Excel manipulation, demonstrating their features through practical examples. At this point, you should have a solid understanding of these tools and their capabilities for efficient Excel manipulation and data analysis.

In the next chapter, we will learn how to write the results to Excel.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform advanced data analysis and visualization techniques with R and Python on Excel data
  • Use exploratory data analysis and pivot table analysis for deeper insights into your data
  • Integrate R and Python code directly into Excel using VBA or API endpoints
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

For businesses, data analysis and visualization are crucial for informed decision-making; however, Excel’s limitations can make these tasks time-consuming and challenging. Extending Excel with Python and R is a game changer resource written by experts Steven Sanderson, the author of the healthyverse suite of R packages, and David Kun, co-founder of Functional Analytics, the company behind the ownR platform engineering solution for R, Python, and other data science languages. This comprehensive guide transforms the way you work with spreadsheet-based data by integrating Python and R with Excel to automate tasks, execute statistical analysis, and create powerful visualizations. Working through the chapters, you’ll find out how to perform exploratory data analysis, time series analysis, and even integrate APIs for maximum efficiency. Whether you're a beginner or an expert, this book has everything you need to unlock Excel's full potential and take your data analysis skills to the next level. By the end of this book, you’ll be able to import data from Excel, manipulate it in R or Python, and perform the data analysis tasks in your preferred framework while pushing the results back to Excel for sharing with others as needed.

What you will learn

Read and write Excel files with R and Python libraries Automate Excel tasks with R and Python scripts Use R and Python to execute Excel VBA macros Format Excel sheets using R and Python packages Create graphs with ggplot2 and Matplotlib in Excel Analyze Excel data with statistical methods and time series analysis Explore various methods to call R and Python functions from Excel

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 30, 2024
Length 344 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781804610695
Category :
Concepts :

Table of Contents

20 Chapters
Preface Chevron down icon Chevron up icon
Part 1:The Basics – Reading and Writing Excel Files from R and Python Chevron down icon Chevron up icon
Chapter 1: Reading Excel Spreadsheets Chevron down icon Chevron up icon
Chapter 2: Writing Excel Spreadsheets Chevron down icon Chevron up icon
Chapter 3: Executing VBA Code from R and Python Chevron down icon Chevron up icon
Chapter 4: Automating Further – Task Scheduling and Email Chevron down icon Chevron up icon
Part 2: Making It Pretty – Formatting, Graphs, and More Chevron down icon Chevron up icon
Chapter 5: Formatting Your Excel Sheet Chevron down icon Chevron up icon
Chapter 6: Inserting ggplot2/matplotlib Graphs Chevron down icon Chevron up icon
Chapter 7: Pivot Tables and Summary Tables Chevron down icon Chevron up icon
Part 3: EDA, Statistical Analysis, and Time Series Analysis Chevron down icon Chevron up icon
Chapter 8: Exploratory Data Analysis with R and Python Chevron down icon Chevron up icon
Chapter 9: Statistical Analysis: Linear and Logistic Regression Chevron down icon Chevron up icon
Chapter 10: Time Series Analysis: Statistics, Plots, and Forecasting Chevron down icon Chevron up icon
Part 4: The Other Way Around – Calling R and Python from Excel Chevron down icon Chevron up icon
Chapter 11: Calling R/Python Locally from Excel Directly or via an API Chevron down icon Chevron up icon
Part 5: Data Analysis and Visualization with R and Python for Excel Data – A Case Study Chevron down icon Chevron up icon
Chapter 12: Data Analysis and Visualization with R and Python in Excel – A Case Study Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.