You're reading from Developing Kaggle Notebooks

Product typeBook

Published inDec 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781805128519

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Gabriel Preda

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

We continue our journey around the world of data by exploring two datasets with geographical distributed data. The first dataset is Every Pub in England. This dataset contains the unique id, name, address, postcode and geographical position data for every pub in England. The second dataset is Starbucks Locations Worldwide. This dataset contains the store number, name, ownership, as well as street address, city and geographical information (latitude and longitude) for all Starbucks stores in the world. We will combine these two datasets and will add also additional geographical support data. We will learn how to work with missing data and perform data imputation if needed, how to visualize geographical data, how to clip and merge polygon data, how to generate custom maps and create multiple layers over maps.

Pubs in England

The dataset contains data about 51,566 pubs in England, including the pub name, the address, the postal code, the geographical position (both by easting and northing and by latitude and longitude) and the local authority. I created a Notebook, Every Pub in England – Data Exploration to investigate this data.

Data quality check

For the data quality check, we will use info() and describe() to get a first glimpse. Then, we can also use our custom data quality statistics functions. We saw in the previous chapter these functions, will not repeat here. Because we will keep using them, we will group them in a utility script. I called this utility script data_quality_stats and I defined in this module the functions missing_data, most_frequent_values and unique_values. To use the functions defined in this utility script, we need to first add it to the Notebook. From File menu, we select Add utility script menu item. Then, we add the import in one of the first Notebook cells...

Starbucks in the World

We start the analysis of Starbucks Locations Worldwide dataset with a detailed exploratory data analysis (EDA) in the notebook Starbucks Location Worldwide - Data Exploration. The tools used in this dataset are imported from data_quality_stats and from plot_style_utils utility scripts. Before starting our analysis, it is important to explain that the dataset used for this analysis is from Kaggle and was collected 6 years ago. Meantime, Starbucks business expanded very much and therefore the number of shops, the geographical distribution of the shops, all this information is not up to date.

Preliminary data analysis

The dataset has 25,600 rows, with only 1 latitude and longitude values missing, 2 Street Addresses, 15 Cities. The fields that have the most missing data are Postcode (5.9%) and Phone Number (26.8%). In Figure 3.16 we can see few a sample of the data.

Figure 4.16. First rows of Starbucks dataset

Looking to the most frequent values report, we can learn...

Pubs and Starbucks in London

Until now our analysis was focused on individual datasets Every Pub in England and Starbucks Locations Worldwide. To support some of the data analysis tasks related to these two separate datasets, we also added two more datasets, one with geographical position of postal codes, to replace missing latitude and longitude data and one shapefile data for United Kingdom, used to clip the Voronoi polygons generated from pubs position, to align them to the land contour of the islands.In the following we will combine the information from the two main data sources analysed separately and will apply methods developed during this preliminary analysis to support the objective of our study. This will focus on a smaller region, where we have both a high density of pubs and also a concentration of Starbucks coffee shops, London. We already can hypothesize that the geospatial concentration of Starbucks is smaller than the concentration of pubs. We would like to see what the...

Summary

In this chapter we learned how to work with geographical information and maps, how to manipulate geometry data (clip and merge polygons data, cluster data to generate maps with less details, remove subsets of geospatial data), superpose several layers of data over maps. We also learned how to modify and extract information from shapefile using geopandas and custom code as well as creating or calculating geospatial features, like terrain area or geospatial objects density. Additionally, we extracted reusable functions and grouped them in two utility scripts, which is Kaggle wording for independent Python modules. These utility scripts can be imported as any other library and integrated with your Notebook code. In the next Chapter we will put at work some of these tools and techniques for a data analytics competition.

References

Every Pub in England, Kaggle Datasets, https://www.kaggle.com/datasets/rtatman/every-pub-in-england
Starbucks Locations Worldwide, Kaggle Datasets, https...

References

Data Science for Good: Kiva Crowdfunding, Kaggle dataset: https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding
Meta Kaggle, Kaggle dataset: https://www.kaggle.com/datasets/kaggle/meta-kaggle
Country Statistics – UNData, Kaggle dataset: https://www.kaggle.com/sudalairajkumar/undata-country-profiles
Kiva Microloans – A Data Exploration, Kaggle notebook: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/blob/develop/Chapter-05/kiva-microloans-a-data-exploration.ipynb
Multidimensional poverty index (MPI): http://hdr.undp.org/en/content/multidimensional-poverty-index-mpi
Multidimensional poverty index on Wikipedia: https://en.wikipedia.org/wiki/Multidimensional_Poverty_Index
plotly-utils, Kaggle utility script: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/blob/develop/Chapter-05/plotly-utils.ipynb
Kiva: Loans that change lives: https://theglobalheroes.wordpress.com/2012...

The rest of the chapter is locked

You have been reading a chapter from

Developing Kaggle Notebooks

Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages