Home Data Applied Geospatial Data Science with Python

Applied Geospatial Data Science with Python

By David S. Jordan
ai-assist-svg-icon Book + AI Assistant
eBook + AI Assistant $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
eBook + AI Assistant $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introducing Geographic Information Systems and Geospatial Data Science
About this book
Data scientists, when presented with a myriad of data, can often lose sight of how to present geospatial analyses in a meaningful way so that it makes sense to everyone. Using Python to visualize data helps stakeholders in less technical roles to understand the problem and seek solutions. The goal of this book is to help data scientists and GIS professionals learn and implement geospatial data science workflows using Python. Throughout this book, you’ll uncover numerous geospatial Python libraries with which you can develop end-to-end spatial data science workflows. You’ll learn how to read, process, and manipulate spatial data effectively. With data in hand, you’ll move on to crafting spatial data visualizations to better understand and tell the story of your data through static and dynamic mapping applications. As you progress through the book, you’ll find yourself developing geospatial AI and ML models focused on clustering, regression, and optimization. The use cases can be leveraged as building blocks for more advanced work in a variety of industries. By the end of the book, you’ll be able to tackle random data, find meaningful correlations, and make geospatial data models.
Publication date:
February 2023
Publisher
Packt
Pages
308
ISBN
9781803238128

 

Introducing Geographic Information Systems and Geospatial Data Science

It is estimated that human society generates several quintillion bytes of data every day. The amount and speed with which our society generates data are also estimated to increase yearly as more and more devices become connected. Devices in the palms of our hands generate rich data assets ranging from detailed human movement data to data on purchasing behavior that connects online transactions with those made at physical storefronts. At the same time, remote sensing devices located outside our atmosphere are generating detailed images, known as satellite imagery, of the Earth at a 0.5-meter resolution, and this detail is improving at a breakneck pace.

Our ability to produce data is only rivaled by our ability to process that same data. Computer ecosystems are rapidly evolving and Moore’s Law, which states that computing power will double roughly every 2 years, is alive and well! Advances in CPUs, GPUs, and data storage components, combined with improved coding languages and analytical methods, allow us to process data and make data-informed decisions faster than ever.

With all of this data and improved technology, organizations and individuals are looking for better and more efficient ways to derive meaning from data that may have once been treated as a byproduct of a technical process. This desire to find meaning in data has led to data science being one of the most in-demand skill sets of the 21st century.

In the introductory chapter of this book, Geospatial Data Science with Python, we’ll begin by defining Geographic Information Systems (GIS), data science, and geospatial data science. These definitions will lay the groundwork and begin to develop a common vernacular that will enable you, the data scientists and traditional GIS professionals, to work in harmony to solve some of the most complex and, dare we say it, fun problems of modern times.

In this chapter, we will cover the following topics:

  • What is GIS?
  • What is data science?
  • What is geospatial data science?
 

What is GIS?

GIS stands for Geographic Information Systems. GIS are computerized systems used in the creation, collection, organization, analysis, and visualization of geospatial data. Geospatial data is a representation of the real world and it is rooted in geography. Geography is the study of the physical features of the Earth and its atmosphere, as well as how human activity impacts both. Human activity is looked at through many lenses, such as population distribution and land usage.

To represent the Earth in a GIS, you will leverage one of two data formats: vectors or rasters. Figure 1.1 shows a stylized version of how real-world data can be represented in vector and raster formats. We’ll define and discuss both of these terms in more detail in Chapter 2, What Is Geospatial Data and Where Can I Find It?

Figure 1.1 – Real-world data in vector and raster format

Figure 1.1 – Real-world data in vector and raster format

A typical GIS enables you to query and combine data assets in relation to the spatial relationship of each asset. This data is then visualized in the form of a static or interactive map or within a mapping application.

Geospatial data stored within a GIS comes in many different formats and from many different domains. A GIS used in local government may include information on the land parcels of local neighborhoods, the roads that run through that neighborhood, and the location of public service infrastructure, such as hospitals and fire stations. A GIS servicing a local weather station may include some of these assets, but will likely also include other types of data, such as real-time feeds of storm paths, rainfall totals, and wind speeds at various points around an area at various times. In Chapter 2, What Is Geospatial Data and Where Can I Find It?, we will focus more on various types of spatial data, their file structure, including shapefiles and GeoJSON, and some of the public sources in which spatial data can be found.

In your day-to-day life, you’ve likely used a GIS platform or an application more frequently than you may have realized. Take, for instance, Google Maps, which is arguably the most used GIS application in the world. Google Maps allows you to search for points of interest around you, such as a coffee shop or an auto mechanic, find directions to these points of interest, and also understand adverse conditions such as rush-hour traffic or roadworks that may impact your commute. There are many other forms of GIS applications out there, including applications that trace the route of an Amazon delivery vehicle as it approaches your home, applications that help you understand where public busses and transit hubs are located, and even applications that help monitor the spread of infectious diseases, as we mentioned in the preface to this book.

In addition to web and mobile GIS systems, there are also desktop-based, point-and-click GIS platforms that allow users to perform more complex spatial operations and analyses. These platforms are often used by specialized GIS practitioners who often have the title of geographer, GIS analyst, GIS engineer, or GIS specialist. These systems are used in a variety of different industries for different purposes. A GIS analyst in local government may use a desktop GIS platform to edit parcel boundaries within a town while a GIS analyst for a rail operator may use it to monitor the operation status and location of each railcar. The uses of GIS and the industries in which it is used are near limitless.

Typically, desktop GIS systems are provided by vendors, with the most dominant vendor in the space being Esri. As the dominant player in the GIS space, Esri’s proprietary software integrates into numerous other applications with other vendors, including Microsoft and AutoCAD. In more recent versions of its software, Esri has also extended its application to work with many open source data science languages, such as Python and R, and Integrated Development Environments (IDEs), such as Jupyter Notebook. This book will focus on open source Python packages that do not require licensing. In Chapter 4, Exploring Geospatial Data Science Packages, we will cover packages including GeoPandas, PySAL, and GeoViews, along with many others you’ll leverage in the case studies later in this book.

Now that you have an understanding of GIS, let’s now define what data science is. As we define data science, hopefully, you’ll begin to see how GIS and data science interact.

 

What is data science?

In the simplest terms, data science is the practice of deriving new insights from raw and disparate data assets and communicating those insights to stakeholders in a way that drives impact. The domain of data science combines facets of mathematics, statistics in particular, with computer science and industry- or domain-specific knowledge. Various sources and authors will define data science and the role of the data scientist differently, with some of these sources including soft skills such as communication and consulting as the fourth pillar of data science. These four pillars of data science are represented in Figure 1.2:

Figure 1.2 – Data science pillars

Figure 1.2 – Data science pillars

Each of these components alone can be tricky to master. It is important to recognize that most data scientists are not experts in all of these areas, but have foundational knowledge in these areas. This enables you to generate and communicate more robust and impactful insights with greater efficiency.

Before we dive deeper into the four components of data science, let’s briefly discuss the data science pipeline at a high level. The pipeline can be thought of as follows:

  1. Collecting
  2. Cleaning
  3. Exploring
  4. Processing
  5. Modeling
  6. Validating
  7. Storytelling

Often, the data science process is not performed in a linear fashion and instead can look something like the process displayed in Figure 1.3:

Figure 1.3 – Data science pipeline

Figure 1.3 – Data science pipeline

While these are the general steps to be completed within a data science pipeline, every pipeline looks a bit different based on the problem you’re trying to solve.

The skills needed at each of these steps also differ, which is why having knowledge across the four pillars of data science is so important. Now that we’ve talked about the data science pipeline, let’s break down the four pillars into a bit more detail.

Mathematics

To be successful as a data scientist, knowledge of mathematics is required, as it is the underpinning of data science. Data science often focuses on taking raw data from the past, identifying patterns in said data, and making a prediction about what will happen in the future based on those patterns. In order to do this, applications of calculus, linear algebra, and statistics are necessary. However, you don’t have to be an expert mathematician or statistician to understand how to identify and test the most suitable analytic method for the problem at hand.

Statistics is especially critical in the earlier stages of a data science process as you are sampling or developing a method for collecting data, as well as developing statistical hypotheses that will later be tested using the data you’ve collected. Statistics is also important as you begin to think through the types of algorithms that are appropriate for your analysis and then as you test each algorithm’s related assumptions. Chapter 6, Hypothesis Testing and Spatial Randomness, will introduce you to hypothesis testing and the concept of spatial randomness, which is a critical hypothesis to test within the context of a geospatial data science workflow. In later stages of the data science process, calculus and linear algebra become more important, as they are the foundation of most algorithms.

Having knowledge of these subjects will allow you to understand the model you’re developing, further refine its accuracy, and explain your model to end users in a comprehensible way. In Part 3, Geospatial Modeling Case Studies, of this book, we will focus on geospatial data science case studies that enact the full scope of the data science process and utilize a variety of algorithms in their solutions. Each case study will provide you with a greater perspective on how taking a geospatial data science approach to your analysis will provide you, and your stakeholders, with richer insights.

Computer science

Computer science is the next domain you’ll need to understand to become successful in data science, as most jobs in this field will require some knowledge of data storage as well as programming skills in Python, R, SAS, Structured Query Language (SQL), or another scripting language.

At the very beginning of the data science pipeline, you’ll need to know where your data is stored or will be stored once it is collected. Relying on traditional file-based storage systems that store data as individual files is no longer suitable for day-to-day work in large enterprise settings. Often, a data scientist will need to use SQL to query data from a relational database such as Teradata, Oracle, or Postgres SQL, to name a few. SQL allows data scientists to query data from different tables and join the related data together based on common identifiers. Data scientists are often required to understand how to connect to a database, query the individual tables that make up the database, and export and transfer the data to the platform that will be used for further analysis, modeling, and visualization.

For geospatial data scientists, SQL also enables you to begin working with geospatial data through the use of spatial SQL. Spatial SQL enables you to perform many spatial operations with ease including point-in polygon intersections, spatial unions or joins, buffers, and Euclidian, or crow-flies, distance calculations. Users of more traditional, desktop-based GIS applications are often amazed at the spatial operations that can be performed, and repeated, with a few simple lines of code.

In the later stages of the pipeline, you’ll need to know Python, R, or SAS in order to develop a machine learning or AI model on the data you’ve collected and explored in earlier phases. Toward the end of the data science pipeline, you’ll then want to use these languages to visualize and interpret your results. For the purpose of this book, we will focus on the Python scripting language, as it is one of the more robust and extensible languages for data science in general, and in particular for geospatial data science. In Chapter 4, Exploring Geospatial Data Science Packages, we will focus on setting up your Python-based geospatial data science environment and provide you with an overview of the packages needed to perform various types of analysis and modeling.

For geospatial data science, you will often run into problems that require the use of large datasets or computationally intensive solutions. In each of these cases, knowledge of computer science skills becomes more important, as these problems can be solved better by leveraging the advancements in distributed (or parallel) computing and big data storage arrays. These environments allow you to break the process and data down into smaller chunks that can be distributed to multiple worker nodes in the parallel compute ecosystem. Breaking down the problem in this way can take a process that would have taken days, weeks, or even months on a single desktop and reduce the time to minutes or seconds.

Industry and domain knowledge

Having the technical skills to pull and analyze data and then develop a model is meaningless without knowledge relevant to the specific industry or domain in which the data science problem is rooted. In data science, there is an adage that states garbage in, garbage out, which often refers to bad data being used to generate a bad model.

Someone who doesn’t have domain-related context will often pull data that isn’t relevant or useful to solve the problem at hand. This bad data, when used to develop a model, will often not yield the insights that the stakeholder was expecting. To prevent pulling bad data, you’ll often need to work hand in hand with stakeholders to understand the full context of the problem or issue you are trying to solve. Once this context is obtained, you’ll be able to pull relevant data, understand the data in relation to the problem, and develop a perspective on the algorithm best suited for the individual situation. Industry- and domain-based knowledge are also necessary as you are developing and validating the results of your model in the later stages of the pipeline.

Soft skills

Often, a data scientist will not be working with other technical individuals or even conducting technical processes for 100% of their day. While a data scientist is required to have a strong technical background and understand the intricate nuances of programming and mathematics, it is rare that the end users or those being supported by a data scientist will have this same knowledge base. As mentioned in the previous section, a data scientist will need to rely on their stakeholders to understand and frame the problem at hand. This requires strong communication and collaboration skills to develop the working relationship. This relationship then becomes even more critical when a data scientist has completed their process and is working to interpret meaningful results. Often, a data scientist will also need to have strong consulting and influencing skills, especially in the business world, as they will need to influence stakeholders to implement and rely on the results of their technical processes.

To be a data scientist, you’ll need to have a working knowledge across a wide array of topics as we’ve discussed in this section. However, data science is not a solo activity and you’ll often be able to rely on others on your team or in the data science community to support and help you develop in the areas in which you’re learning. Data science is a practice and we’re all learning and growing every day.

Having developed an understanding of GIS and data science, you should now start to have an inkling about how they combine to form geospatial data science. We’ll talk more about this powerful combination in the next section.

 

What is geospatial data science?

Geospatial data science lies at the intersection of data science and GIS as depicted in Figure 1.4:

Figure 1.4 – Geospatial data science Venn diagram

Figure 1.4 – Geospatial data science Venn diagram

Geospatial data science is a subclass of general data science that concentrates on geospatial data, its unique properties, and specialized techniques and computation methods necessary for deriving insights from this data. Instead of treating spatial data as another feature in a tabular dataset, spatial data science goes deeper into understanding why things are happening in a particular place and how they are related, or unrelated, to the things going on around it. Spatial data science focuses on identifying spatial relationships based on location, distance, and intersections between objects. We’ll talk more about spatial relationships in Part 2, Exploratory Spatial Data Analysis, of this book, as we conduct exploratory spatial data analysis.

While this book will focus primarily on geospatial data science, that is, data science focused on data pertaining to the Earth, it is worth noting that the concepts can be expanded and translated to general spatial data science. Spatial data science focuses on where the datum is located and how it is related across space and therefore can be applied to problems at a microscopic scale, such as the location of atoms in your body, or problems much larger, such as the distance between asteroids in the main belt between Mars and Jupiter.

By looking at data while taking into account its spatial relationships, you will often find substantial improvements to existing or developing models. As mentioned at the start of this chapter, the domain of spatial data science is rapidly growing and new avenues for exploration are uncovered every day.

 

Summary

In this chapter, we defined the differences and commonalities between GIS, data science, and geospatial data science. As we discussed data science, we took a deep dive into the four pillars of data science, which include mathematics, computer science, domain and industry knowledge, and soft skills.

We also briefly discussed the stages involved in the data science process. Parts 2 and 3 of this book will provide you with more hands-on experience in implementing the data science process through exploratory data analysis, hypothesis testing, and in-depth data science use cases, covering a variety of topics and algorithms.

We also discussed how the principles of geospatial data science can be applied more broadly within the domain of spatial data science to solve problems at a smaller, microscopic level, as well as larger, astronomical scales. The power of geospatial data science is only starting to be realized as industries, data storage, and computing methodologies evolve. We’re excited that you’ve decided to embark on this learning journey with us and are even more excited to see what you achieve in your journey to become a geospatial data scientist.

In the next chapter, we’ll dive deeper into the world of geospatial data, which we briefly described in this chapter as being a representation of the real world in vector or raster format. We’ll also spend time in the next chapter discussing the rich sources of open geospatial data.

About the Author
  • David S. Jordan

    David S. Jordan has made a career out of applying spatial thinking to tough problem spaces in the domains of real estate planning, disaster response, social equity, and climate change. He currently leads distribution and geospatial data science at JPMorgan Chase & Co. In addition to leading and building out geospatial data science teams, David is a patented inventor of new geospatial analytics processes, a winner of a Special Achievement in GIS (SAG) Award from Esri, and a conference speaker on topics including banking deserts and how great businesses leverage GIS.

    Browse publications by this author
Applied Geospatial Data Science with Python
Unlock this book and the full library FREE for 7 days
Start now