Introduction to Geospatial Data in the Cloud
This book is divided into four parts that will walk you through key concepts, tools, and techniques for dealing with geospatial data. Part 1 sets the foundation for the entire book, establishing key ideas that provide synergy with subsequent parts. Each chapter is further subdivided into topics that dive deep into a specific subject. This introductory chapter of Part 1 will cover the following topics:
- Introduction to cloud computing and AWS
- Storing geospatial data in the cloud
- Building your geospatial data strategy
- Geospatial data management best practices
- Cost management in the cloud
Introduction to cloud computing and AWS
You are most likely familiar with the benefits that geospatial analysis can provide. Governmental entities, corporations, and other organizations routinely solve complex, location-based problems with the help of geospatial computing. While paper maps are still around, most use cases for geospatial data have evolved to live in the digital world. We can now create maps faster and draw more geographical insights from data than at any point in history. This phenomenon has been made possible by blending the expertise of geospatial practitioners with the power of Geographical Information Systems (GIS). Critical thinking and higher-order analysis can be done by humans while computers handle the monotonous data processing and rendering tasks. As the geospatial community continues to refine the balance of which jobs require manual effort and which can be handled by computers, we are collectively improving our ability to understand our world.
Geospatial computing has been around for decades, but the last 10 years have seen a dramatic shift in the capabilities and computing power available to practitioners. The emergence of the cloud as a fundamental building block of technical systems has offered needle-moving opportunities in compute, storage, and analytical capabilities. In addition to a revolution in the infrastructure behind GIS systems, the cloud has expanded the optionality in every layer of the technical stack. Common problems such as running out of disk space, long durations of geospatial processing jobs, limited data availability, and difficult collaboration across teams can be things of the past. AWS provides solutions to these problems and more, and in this book, we will describe, dissect, and provide examples of how you can do this for your organization.
Cloud computing provides the ability to rapidly experiment with new tools and processing techniques that would never be possible using a fixed set of compute resources. Not only are new capabilities available and continually improving but your team will also have more time to learn and use these new technologies with the time saved in creating, configuring, and maintaining the environment. The undifferenced heavy lifting of managing geospatial storage devices, application servers, geodatabases, and data flows can be replaced with time spent analyzing, understanding, and visualizing the data. Traditional this or that technical trade-off decisions are no longer binary proposals. Your organization can use the right tool for each job, and blend as many tools and features into your environment as is appropriate for your requirements. By paying for the precise amount of resources you use in AWS, it is possible to break free from restrictive, punitive, and time-limiting licensing situations. In some cases, the amount of an AWS compute resource you use is measured and charged down to the millisecond, so you literally don’t pay for a second of unused time. If a team infrequently needs to leverage a capability, such as a monthly data processing job, this can result in substantial cost savings by eliminating idle virtual machines and supporting technical resources. If cost savings are not your top concern, the same proportion of your budget can be dedicated to more capable hardware that delivers dramatically reduced timeframes compared to limited compute environments.
The global infrastructure of AWS allows you to position data in the best location to minimize latency, providing the best possible performance. Powerful replication and caching technologies can be used to minimize wait time and allow robust cataloging and characterization of your geospatial assets. The global flexibility of your GIS environment is further enabled with the use of innovative end user compute options. Virtual desktop services in AWS allow organizations to keep the geospatial processing close to the data for maximum performance, even if the user is geographically distanced from both. AWS and the cloud have continued to evolve and provide never-before-seen capabilities in geospatial power and flexibility. Over the course of this book, we will examine what these concepts are, how they work, and how you can put them to work in your environment.
Now that we have learned the story of cloud computing on AWS, let’s check out how we can implement geospatial data there.
Storing geospatial data in the cloud
As you learn about the possibilities for storing geospatial data in the cloud, it may seem daunting due to the number of options available. Many AWS customers experiment with Amazon Simple Storage Service (S3) for geospatial data storage as their first project. Relational databases, NoSQL databases, and caching options commonly follow in the evolution of geospatial technical architectures. General GIS data storage best practices still apply to the cloud, so much of the knowledge that practitioners have gained over the years directly applies to geospatial data management on AWS. Familiar GIS file formats that work well in S3 include the following:
- Shapefiles (
.shp
,.shx
,.dbf
,.prj
, and others) - File geodatabases (
.gdb
) - Keyhole Markup Language (
.kml
) - Comma-Separated Values (
.csv
) - Geospatial JavaScript Object Notation (
.geojson
) - Geostationary Earth Orbit Tagged Image File Format (
.tiff
)
The physical location of data is still important for latency-sensitive workloads. Formats and organization of data can usually remain unchanged when moving to S3 to limit the impact of migrations. Spatial indexes and use-based access patterns will dramatically improve the performance and ability of your system to deliver the desired capabilities to your users.
Relational databases have long been the cornerstone of most enterprise GIS environments. This is especially true for vector datasets. AWS offers the most comprehensive set of relational database options with flexible sizing and architecture to meet your specific requirements. For customers looking to migrate geodatabases to the cloud with the least amount of environmental change, Amazon Elastic Compute Cloud (EC2) virtual machine instances provide a similar capability to what is commonly used in on-premises data centers. Each database server can be instantiated on the specific operating system that is used by the source server. Using EC2 with Amazon Elastic Block Store (EBS) network-attached storage provides the highest level of control and flexibility. Each server is created by specifying the amount of CPU, memory, and network throughput desired. Relational database management system (RDBMS) software can be manually installed on the EC2 instance, or an Amazon Machine Image (AMI) for the particular use case can be selected from the AWS catalog to remove manual steps from the process. While this option provides the highest degree of flexibility, it also requires the most database configuration and administration knowledge.
Many customers find it useful to leverage Amazon Relational Database Service (RDS) to establish database clusters and instances for their GIS environments. RDS can be leveraged by creating full-featured database Microsoft SQL Server, Oracle, PostgreSQL, MySQL, or MariaDB clusters. AWS allows the selection of specific instance types to focus on memory or compute optimization in a variety of configurations. Multiple Availability Zone (AZ)-enabled databases can be created to establish fault tolerance or improve performance. Using RDS dramatically simplifies database administration, and decreases the time required to select, provision, and configure your geospatial database using the specific technical parameters to meet the business requirements.
Amazon Aurora provides an open source path to highly capable and performant relational databases. PostgreSQL or MySQL environments can be created with specific settings for the desired capabilities. Although this may mean converting data from a source format, such as Microsoft SQL Server or Oracle, the overall cost savings and simplified management make this an attractive option to modernize and right-size any geospatial database.
In addition to standard relational database options, AWS provides other services to manage and use geospatial data. Amazon Redshift is the fastest and most widely used cloud data warehouse and supports geospatial data through the geometry
data type. Users can query spatial data in Redshift’s built-in SQL functions to find the distance between two points, interrogate polygon relationships, and provide other location insights into their data. Amazon DynamoDB is a fully managed, key-value NoSQL database with an SLA of up to 99.999% availability. For organizations leveraging MongoDB, Amazon DocumentDB provides a fully managed option for simplified instantiation and management. Finally, AWS offers the Amazon OpenSearch Service for petabyte-scale data storage, search, and visualization.
The best part is that you don’t have to choose a single option for your geospatial environment. Often, companies find that different workloads benefit from having the ability to choose the most appropriate data landscape. Combining Infrastructure as a Service (IaaS) workloads with fully managed databases and modern databases is not only possible but a signature of a well-architected geospatial environment. Transactional systems may benefit from relational geodatabases, while mobile applications may be more aligned with NoSQL data stores. When you operate in a world of consumption-based resources, there is no downside to using the most appropriate data store for each workload. Having familiarity with the cloud options for storing geospatial data is crucial in strategic planning, which we will cover in the next topic.