Geospatial Data Analytics on AWS

Introduction to Geospatial Data in the Cloud

This book is divided into four parts that will walk you through key concepts, tools, and techniques for dealing with geospatial data. Part 1 sets the foundation for the entire book, establishing key ideas that provide synergy with subsequent parts. Each chapter is further subdivided into topics that dive deep into a specific subject. This introductory chapter of Part 1 will cover the following topics:

Introduction to cloud computing and AWS
Storing geospatial data in the cloud
Building your geospatial data strategy
Geospatial data management best practices
Cost management in the cloud

Introduction to cloud computing and AWS

You are most likely familiar with the benefits that geospatial analysis can provide. Governmental entities, corporations, and other organizations routinely solve complex, location-based problems with the help of geospatial computing. While paper maps are still around, most use cases for geospatial data have evolved to live in the digital world. We can now create maps faster and draw more geographical insights from data than at any point in history. This phenomenon has been made possible by blending the expertise of geospatial practitioners with the power of Geographical Information Systems (GIS). Critical thinking and higher-order analysis can be done by humans while computers handle the monotonous data processing and rendering tasks. As the geospatial community continues to refine the balance of which jobs require manual effort and which can be handled by computers, we are collectively improving our ability to understand our world.

Geospatial computing has been around for decades, but the last 10 years have seen a dramatic shift in the capabilities and computing power available to practitioners. The emergence of the cloud as a fundamental building block of technical systems has offered needle-moving opportunities in compute, storage, and analytical capabilities. In addition to a revolution in the infrastructure behind GIS systems, the cloud has expanded the optionality in every layer of the technical stack. Common problems such as running out of disk space, long durations of geospatial processing jobs, limited data availability, and difficult collaboration across teams can be things of the past. AWS provides solutions to these problems and more, and in this book, we will describe, dissect, and provide examples of how you can do this for your organization.

Cloud computing provides the ability to rapidly experiment with new tools and processing techniques that would never be possible using a fixed set of compute resources. Not only are new capabilities available and continually improving but your team will also have more time to learn and use these new technologies with the time saved in creating, configuring, and maintaining the environment. The undifferenced heavy lifting of managing geospatial storage devices, application servers, geodatabases, and data flows can be replaced with time spent analyzing, understanding, and visualizing the data. Traditional this or that technical trade-off decisions are no longer binary proposals. Your organization can use the right tool for each job, and blend as many tools and features into your environment as is appropriate for your requirements. By paying for the precise amount of resources you use in AWS, it is possible to break free from restrictive, punitive, and time-limiting licensing situations. In some cases, the amount of an AWS compute resource you use is measured and charged down to the millisecond, so you literally don’t pay for a second of unused time. If a team infrequently needs to leverage a capability, such as a monthly data processing job, this can result in substantial cost savings by eliminating idle virtual machines and supporting technical resources. If cost savings are not your top concern, the same proportion of your budget can be dedicated to more capable hardware that delivers dramatically reduced timeframes compared to limited compute environments.

The global infrastructure of AWS allows you to position data in the best location to minimize latency, providing the best possible performance. Powerful replication and caching technologies can be used to minimize wait time and allow robust cataloging and characterization of your geospatial assets. The global flexibility of your GIS environment is further enabled with the use of innovative end user compute options. Virtual desktop services in AWS allow organizations to keep the geospatial processing close to the data for maximum performance, even if the user is geographically distanced from both. AWS and the cloud have continued to evolve and provide never-before-seen capabilities in geospatial power and flexibility. Over the course of this book, we will examine what these concepts are, how they work, and how you can put them to work in your environment.

Now that we have learned the story of cloud computing on AWS, let’s check out how we can implement geospatial data there.

Storing geospatial data in the cloud

As you learn about the possibilities for storing geospatial data in the cloud, it may seem daunting due to the number of options available. Many AWS customers experiment with Amazon Simple Storage Service (S3) for geospatial data storage as their first project. Relational databases, NoSQL databases, and caching options commonly follow in the evolution of geospatial technical architectures. General GIS data storage best practices still apply to the cloud, so much of the knowledge that practitioners have gained over the years directly applies to geospatial data management on AWS. Familiar GIS file formats that work well in S3 include the following:

Shapefiles (.shp, .shx, .dbf, .prj, and others)
File geodatabases (.gdb)
Keyhole Markup Language (.kml)
Comma-Separated Values (.csv)
Geospatial JavaScript Object Notation (.geojson)
Geostationary Earth Orbit Tagged Image File Format (.tiff)

The physical location of data is still important for latency-sensitive workloads. Formats and organization of data can usually remain unchanged when moving to S3 to limit the impact of migrations. Spatial indexes and use-based access patterns will dramatically improve the performance and ability of your system to deliver the desired capabilities to your users.

Relational databases have long been the cornerstone of most enterprise GIS environments. This is especially true for vector datasets. AWS offers the most comprehensive set of relational database options with flexible sizing and architecture to meet your specific requirements. For customers looking to migrate geodatabases to the cloud with the least amount of environmental change, Amazon Elastic Compute Cloud (EC2) virtual machine instances provide a similar capability to what is commonly used in on-premises data centers. Each database server can be instantiated on the specific operating system that is used by the source server. Using EC2 with Amazon Elastic Block Store (EBS) network-attached storage provides the highest level of control and flexibility. Each server is created by specifying the amount of CPU, memory, and network throughput desired. Relational database management system (RDBMS) software can be manually installed on the EC2 instance, or an Amazon Machine Image (AMI) for the particular use case can be selected from the AWS catalog to remove manual steps from the process. While this option provides the highest degree of flexibility, it also requires the most database configuration and administration knowledge.

Many customers find it useful to leverage Amazon Relational Database Service (RDS) to establish database clusters and instances for their GIS environments. RDS can be leveraged by creating full-featured database Microsoft SQL Server, Oracle, PostgreSQL, MySQL, or MariaDB clusters. AWS allows the selection of specific instance types to focus on memory or compute optimization in a variety of configurations. Multiple Availability Zone (AZ)-enabled databases can be created to establish fault tolerance or improve performance. Using RDS dramatically simplifies database administration, and decreases the time required to select, provision, and configure your geospatial database using the specific technical parameters to meet the business requirements.

Amazon Aurora provides an open source path to highly capable and performant relational databases. PostgreSQL or MySQL environments can be created with specific settings for the desired capabilities. Although this may mean converting data from a source format, such as Microsoft SQL Server or Oracle, the overall cost savings and simplified management make this an attractive option to modernize and right-size any geospatial database.

In addition to standard relational database options, AWS provides other services to manage and use geospatial data. Amazon Redshift is the fastest and most widely used cloud data warehouse and supports geospatial data through the geometry data type. Users can query spatial data in Redshift’s built-in SQL functions to find the distance between two points, interrogate polygon relationships, and provide other location insights into their data. Amazon DynamoDB is a fully managed, key-value NoSQL database with an SLA of up to 99.999% availability. For organizations leveraging MongoDB, Amazon DocumentDB provides a fully managed option for simplified instantiation and management. Finally, AWS offers the Amazon OpenSearch Service for petabyte-scale data storage, search, and visualization.

The best part is that you don’t have to choose a single option for your geospatial environment. Often, companies find that different workloads benefit from having the ability to choose the most appropriate data landscape. Combining Infrastructure as a Service (IaaS) workloads with fully managed databases and modern databases is not only possible but a signature of a well-architected geospatial environment. Transactional systems may benefit from relational geodatabases, while mobile applications may be more aligned with NoSQL data stores. When you operate in a world of consumption-based resources, there is no downside to using the most appropriate data store for each workload. Having familiarity with the cloud options for storing geospatial data is crucial in strategic planning, which we will cover in the next topic.

Building your geospatial data strategy

One of the most important concepts to consider in your geospatial data strategy is the amount of change you are willing to accept in your technical infrastructure. This does not apply to new systems, but most organizations will have a treasure trove of geospatial data already. While lifting and shifting on-premises workloads to the cloud is advantageous, adapting your architecture to the cloud will amplify benefits in agility, resiliency, and cost optimization. For example, 95% of AWS customers elect to use open source geospatial databases as part of their cloud migration. This data conversion process, from vendor relational databases such as Oracle and Microsoft SQL Server to open source options such as PostgreSQL, enjoys a high degree of compatibility. This is an example of a simple change that can be made to eliminate significant license usage costs when migrating to the cloud. Simple changes such as these provide immediate and tangible benefits to geospatial practitioners in cloud architectures. Often, the same capabilities can be provided in AWS for a significantly reduced cost profile when comparing the cloud to on-premises GIS architectures.

All the same concepts and technologies you and your team are used to when operating an on-premises environment exist on AWS. Stemming from the consumption-based pricing model and broad set of EC2 instances available, AWS can offer a much more flexible model for the configuration and consumption of compute resources. Application servers used in geospatial environments can be migrated directly by selecting the platform, operating system, version, and dependencies appropriate for the given workload. Additional consideration should be given in this space to containerization where feasible. Leveraging containers in your server architecture can speed up environment migrations and provide additional scaling options.

Preventing unauthorized access

A key part of building your geospatial data strategy is determining the structure and security of your data. AWS Identity and Access Management (IAM) serves as the foundation for defining authorization and authentication mechanisms in your environment. Single Sign-On (SSO) is commonly used to integrate with existing directories to leverage pre-existing hierarchies and permission methodologies. The flexibility of AWS allows you to bring the existing security constructs while expanding the ability to monitor, audit, and rectify security concerns in your GIS environment. It is highly recommended to encrypt most data; however, the value of encrypting unaltered public data can be debated. Keys should be regularly rotated and securely handled in accordance with any existing policies or guidelines from your organization.

As changes take place within your architecture, alerts and notifications provide critical insight to stewards of the environment. Amazon Simple Notification Service (SNS) can be integrated with any AWS service to send emails or text messages to the appropriate teams or individuals for optimized performance and security. Budgets and cost management alerts are native to AWS, making it easy to manage multiple accounts and environments based on your organization’s key performance indicators. Part of developing a cloud geospatial data strategy should be to internally ask where data issues are going unnoticed or not being addressed. By creating business rules, thresholds, and alerts, these data anomalies can notify administrators when specific areas within your data environment need attention.

The last mile in data consumption

Some commonly overlooked aspects of a geospatial data management strategy are the desktop end user tools that are necessary to manage and use the environment. Many GIS environments are dependent on high-powered desktop machines used by specialists. The graphics requirements for visualizing spatial data into a consumable image can be high, and the data throughput must support fluid panning and zooming through the data. Complications can arise when the user has a high-latency connection to the data. Many companies learned this the hard way when remote workers during COVID tried to continue business as usual from home. Traditional geospatial landscapes were designed for the power users to be in the office. Gigabit connectivity was a baseline requirement, and network outages meant that highly paid specialists were unable to do their work.

Virtual desktops have evolved, and continue to evolve, to provide best-in-class experiences for power users that are not co-located with their data. Part of a well-architected geospatial data management strategy is to store once, use many times. This principle takes a backseat when the performance when used is unacceptable. A short-term fix is to cache the data locally, but that brings a host of other cost and concurrency problems. Virtual desktops or Desktop-as-a-Service (DaaS) address this problem by keeping the compute close to the data. The user can be thousands of miles away and still enjoy a fluid graphical experience. Amazon WorkSpaces and Amazon AppStream provide this capability in the cloud. WorkSpaces provides a complete desktop environment for Windows or Linux that can be configured exactly as your specialists have today. AppStream adds desktop shortcuts to a specialist’s local desktop and streams the application visuals as a native application. Having access to the native geospatial data management tools as part of a cloud-based architecture results in a more robust and cohesive overall strategy.

Leveraging your AWS account team

AWS provides corporations and organizational customers with a dedicated account team to help navigate the details of using cloud services. When it comes to migrating existing geospatial data, numerous incentive programs exist. Your AWS account team can help you identify areas where credits and other strategic incentives may apply to your situation. In addition to financial assistance, AWS has developed a robust methodology and processes for migrating data and workloads to the cloud. The AWS Migration Accelerate Program (MAP) draws on experience gained from thousands of enterprise customer migrations. MAP educates customers on the methodology, tools, partners, professional services, and investments that are available to customers. Whether AWS or a systems integrator (SI) partner provides the guidance, it is highly recommended to leverage this experience in your cloud data management strategy.

Now that we’ve covered the strategic side of things, let’s look at some best practices you can incorporate into your tactics for establishing a geospatial cloud landscape.

Geospatial data management best practices

The single most important consideration in a data management strategy is a deep understanding of the use cases the data intends to support. Data ingestion workflows need to eliminate bottlenecks in write performance. Geospatial transformation jobs need access to powerful computational resources, and the ability to cache large amounts of data temporarily in memory. Analytics and visualization concerns require quick searching and the retrieval of geospatial data. These core disciplines of geospatial data management have benefitted from decades of fantastic work done by the community, which has driven AWS to create pathways to implement these best practices in the cloud.

Data – it’s about both quantity and quality

A long-standing anti-pattern of data management is to rely primarily on folder structures or table names to infer meaning about datasets. Having naming standards is a good thing, but it is not a substitute for a well-formed data management strategy. Naming conventions invariably change over time and are never fully able to account for the future evolution of data and the resulting taxonomy. In addition to the physical structure of the data, instrumenting your resources with predefined tags and metadata becomes crucial in cloud architectures. This is because AWS inherently provides capabilities to specify more information about your geospatial data, and many of the convenient tools and services are built to consume and understand these designations. Enriching your geospatial data with the appropriate metadata is a best practice in the cloud as it is for any GIS.

Another best practice is to quantify your data quality. Simply having a hunch that your data is good or bad is not sufficient. Mature organizations not only quantitatively describe the quality of their data with continually assessed scores but also track the scores to ensure that the quality of critical data improves over time. For example, if you have a dataset of addresses, it is important to know what percentage of the addresses are invalid. Hopefully, that percentage is 0, but very rarely is that the case. More important than having 100% accurate data is having confidence in what the quality of a given dataset is… today. Neighborhoods are being built every day. Separate buildings are torn down to create apartment complexes. Perfect data today may not be perfect data tomorrow, so the most important aspect of data quality is real-time transparency. A threshold should be set to determine the acceptable data quality based on the criticality of the dataset. High-priority geospatial data should require a high bar for quality, while infrequently used low-impact datasets don’t require the same focus. Categorizing your data based on importance allows you to establish guidelines by category. This approach will allow finite resources to be directed toward the most pressing concerns to maximize value.

People, processes, and technology are equally important

Managing geospatial data successfully in the cloud relies on more than just the technology tools offered by AWS. Designating appropriate roles and responsibilities in your organization ensures that your cloud ecosystem will be sustainable. Avoid single points of failure with respect to skills or tribal knowledge of your environment. Having at least a primary and secondary person to cover each area will add resiliency to your people operations. Not only will this allow you to have more flexibility in coverage and task assignment but it also creates training opportunities within your team and allows team members to continually learn and improve their skills.

Next, let’s move on to talk about how to stretch your geospatial dollars to do more with less.

Cost management in the cloud

The easiest benefit to realize in a cloud-based geospatial environment is cost savings. While increased agility, hardened resiliency, improved performance, and reduced maintenance effort are also key benefits, they generally take some time to fully realize. Cost reductions and increased flexibility are immediately apparent when you leverage AWS for geospatial workloads. Charges for AWS resources are continuously visible in the AWS console, and the amount of expenditure can be adjusted in real time based on your business needs. In this section of the chapter, we will examine cost management tactics for the following areas:

Hardware provisioning
Geodatabase servers
File-based data
Geospatial application servers
End user compute services

Right-sizing, simplified

I recall many times in my career when I consternated for days over which server to buy. Buying a server is a big decision, and buying the wrong one can have real consequences. What if we’re more successful than projected and our user count doubles? What if we have more data than estimated? What if my processes consume more resources than expected? These are just a few of the questions that compel organizations to buy bigger servers than necessary. Of course, it makes sense to plan for the future, but what doesn’t make sense is paying for things you don’t use. This problem is only amplified when you bring resiliency and disaster recovery (DR) into the picture. I’ve designed enterprise GIS systems that have completely dormant standby instances for very expensive servers. In an on-premises data center, your "just-in-case" hardware has to be paid for even though it is not used. AWS provides a full range of DR capabilities without additional license or hardware overhead costs.

The elephant in the server room

One of the largest costs in a geospatial environment for many organizations is the relational geodatabase infrastructure. Hefty enterprise license costs, expensive hardware, and dedicated time from a specialist database administrator (DBA) and supporting resources add up quickly. Remember that having a cloned standby environment for critical systems may be required for production workloads. Power, cooling, and networking charges apply for on-premises environments.

A typical concern surrounding RDBMS migration to the cloud is performance, specifically as it relates to scale. The same week that I began working at AWS, a multi-year effort across all Amazon companies was wrapping up. All of Amazon’s internal databases were modernized, many of which were converted from Oracle. Alexa, Amazon Prime, Amazon Prime Video, Amazon Fresh, Kindle, Amazon Music, Audible, Shopbop, Twitch, and Zappos were the customer-facing brands that were part of the migration, resulting in Amazon turning off its final Oracle database in October 2019. The scale of internal databases at Amazon is mind-boggling, but the migration of 75 petabytes was realized with little or no downtime. Resulting reductions in database costs were over 60%, coupled with latency performance improvements of 40%. This project was enormous in scale, and the cost savings have been enormous as well.

Bird’s-eye view on savings

Raster data is being collected with increasing frequency and resolution. Sentinel-2 provides satellite imagery data all over the globe within the most recent 5 days. The quality of the images continues to improve, as does the file size. Storing long histories of file-based data is commonplace as historical images and data may someday be needed. Corporations may have legal obligations to retain the data. Whatever the reason, storing data generates costs. Those costs increase as the size and volume of data increase. Raster geospatial data is notoriously large and commonly stored in enterprise filesystems. When organizations have multiple copies of data for different purposes or departments, the sustained long-term expenses can be exorbitant.

The costs associated with storing large volumes of file-based data in AWS are completely under the customer’s control. Amazon S3 is simple to use and compatible with any geospatial data format. In fact, some formats that we’ll talk more about later in this book perform best in the cloud. Consolidating geospatial data to a platform with fine-grained life cycle options can be a cost game-changer. The data for both performance and cost can be optimized at the same time using an S3 Lifecycle configuration. These storage classification rules will price data differently based on the usage pattern. A great example of geospatial data is Extract, Transform, and Load (ETL) staging datasets. Processing jobs may leave behind transient datasets as large as the source data, and possibly multiple copies of them for each process run. Historical data may be accessed frequently for dates within the most recent month, but rarely for older data. Another great use case for an S3 Lifecycle configuration is data that is meant to be archived initially for the lowest long-term storage cost.

Amazon S3 provides automated rules that move files between various pricing models. The rules are customer-defined and can be changed at any time in the AWS console. Using just a few simple clicks, it is possible to realize massive storage cost savings. Most geospatial data brought into AWS starts in the S3 Standard storage class. This feature-rich, general-purpose option provides 99.999999999% (11 9s) of durability for a few cents per GB per month. While this is affordable, the S3 Glacier Deep Archive storage class is designed for long-term archives accessed infrequently for just 0.00099 per GB per month. IT backups of geospatial databases and filesystems are prime candidates for S3 Glacier Deep Archive. Details of each storage class in between, associated use cases, and pricing are available on the S3 pricing page. There is also an "easy button" to optimize your S3 Lifecycle using Intelligent-Tiering. The key takeaway is that file-based storage, mainly raster geospatial data, can be stored in the cloud for a fraction of on-premises costs. When it comes to geospatial cost management strategy, file-based data storage classification can yield tremendous cost savings.

Can’t we just add another server?

Application servers are the workhorse of a robust GIS architecture. Specialized servers deliver web map services, imagery, geoprocessing, and many other compute-intensive capabilities. While the number of application servers in an architecture generally outnumbers database servers, the storage and networking performance requirements tend to be lower. These are cheaper machines that perform specific tasks. Horizontal scaling is commonly used by providing multiple servers that can each execute independent, parallel tasks. Resource demands and traffic patterns tend to be erratic and spiky, resulting in underutilized CPU and GPU cores.

Launching geospatial application server capabilities on AWS can be done in a number of ways, but the most common is EC2 migration. If you have a geoprocessing server that supports horizontal scaling, it may be possible to add processing nodes in AWS to your existing server pool. Over time, you can adjust the utilization of servers based on the requirements and cost profile. Cloud servers can be deactivated when not in use to stop compute charges, and the EC2 spot pricing option provides a flexible way to get instances at a discount of up to 90% compared to on-demand prices. AWS Auto Scaling provides multiple options to control how and when servers start up and shut down based on demand requirements. If you have a server dedicated to a monthly process that takes 8 hours, 91% of your server capacity is unutilized. Understanding the steady-state processing profile of your geospatial environment allows you to identify where cost-saving compute opportunities exist. By leveraging AWS applied to these compute profiles, you’ll be able to get more processing done in less time, and at a lower cost.

Additional savings at every desk

I worked in the energy technology industry since before Y2K was in the headlines. One interesting cultural phenomenon I’ve seen in the workplace occurs among geoscientists, engineers, technical SMEs, and others who do specialized compute-intensive work. The power of your work machine is a badge of honor, where the highest regarded professionals are awarded more CPUs or an additional monitor. While this philosophy attracted the best petrophysicists, geologists, and other specialists, it generated significant costs. Infrequently utilized workstations sat idle, not returning any value for their purchase. This scenario is highly likely in volatile industries where layoffs are frequent and contractor usage is high. Imagine the waste if the machine is turned on 24/7, even if it is only used for a few hours a week.

DaaS provides a flexible cost-saving solution for underutilized workstations. By provisioning your workstation in the cloud, you can take advantage of larger amounts of processing power and only pay for the hours consumed. Windows license portability applies in some cases, and you can select configurations such as the GraphicsPro bundle, which packs a whopping 16 vCPU with an additional GPU, 8 GiB of video memory, and 122 GiB of general memory. At the time of writing, that machine would cost around $100 per month if only used for a few hours of monthly geospatial analysis (including the Windows license). Additional savings can be realized through reduced IT administration. AWS manages the hardware and service management, leaving the customer in charge of machine images, applications, and security administration.

As described in the preceding paragraph, the AWS cloud provides powerful and flexible services that help costs align with your geospatial activities. It all starts by building the right cloud strategy and establishing an empowered team to discover and operationalize digital innovation. Endorsement or sponsorship from forward-looking executives has proven to be correlated with success in cloud technology projects. There are new ways to get things done in the cloud that can be a fraction of the cost of traditional methods. All of these concepts factor into your evergreen geospatial computing strategy and result in better geospatial data and insights delivered to your end users.