This chapter is an overview of geospatial analysis and will cover the following topics:
How geospatial analysis is impacting our world
A history of geospatial analysis including Geographic Information Systems (GIS) and remote sensing
Reasons for using a programming language for geospatial analysis
Importance of more people learning geospatial analysis
Remote sensing concepts
Creating the simplest possible GIS using Python
This book assumes some basic knowledge of Python, some IT literacy, and at least an awareness of geospatial analysis. This chapter provides a foundation in geospatial analysis, needed to attack any subject in the areas of remote sensing and GIS including the material in all the other chapters of the book.
The morning of November 7, 2012, saw political experts in the United States scrambling to explain how incumbent Democratic President, Barack Obama, had pulled off such a decisive election victory. They scrambled because none of them had seen the win comingâat least not the 332 electoral college votes for Obama, to Republican candidate Mitt Romney's anemic 206. The major political polling organizations had also unanimously declared the race would be a photo finish in the weeks leading up to the election.
Political experts offered broad explanations including "a better ground campaign" by Obama, "demographic shifts" that favored the Democrats, and even accusations of a weakened Republican Party brand. But these generalized theories fell far short of explaining the results in any satisfying detail. The following map shows the electoral votes received by each candidate:
The explanation for the political upset came instead from a 34 year old blogger from Michigan, named Nate Silver. Armed with only a laptop, he had predicted the exact outcome long before the election day, and he had done so with startling precision.
Both election campaigns calculated multiple winning scenarios which followed a path of winning certain key battleground states. The battleground states are also known as swing states, because neither candidate had overwhelming support from that state going into the election. These states included Colorado, Florida, Iowa, Nevada, New Hampshire, North Carolina, Ohio, Virginia, and Wisconsin. But Silver had called these states accurately as if they had been known all along.
Silver's method for predicting the future can be summed up as geostatistical profiling. He used geographic analysis to fill in gaps in polling data that caused other analysts to have inaccurate predictions. Large polling organizations poll states on a rolling but irregular basis leading up to elections. Furthermore, different organizations use different polling approaches. Silver first weighted these pollsters based on their historical accuracy and calculated an error rate.
He could then average polls together and account for potential error. His second innovation was to profile states based on historical voting trends and demographics. He could then classify similar states and even voting districts. Anywhere he was missing polling data from a particular state, he could find surrogate data from a similar state and extrapolate to complete his data set. The combination of careful weighting and extrapolation allowed Silver to run a more robust national voting model which paid off. Interestingly, Silver's political models use many of the same elements of probability theory used in his PECOTA software he had developed earlier for baseball but with a geospatial twist. The following plot shows an accuracy comparison of researchers and political experts. The analysts using geospatial techniques led the pack by a wide margin.
It would be one thing if Nate Silver had been the only one to come up with such an accurate prediction. But he was just the most visible due to his high-profile blog on the New York Times, and his articulate and detailed posts about his methods. He recognized many other analysts including Sam Wang of the Princeton Election Consortium and David Linzer of Emory University, who used similar geostatistical methods and achieved highly accurate results. Silver was on the crest of a wave of geospatial analysts who were bringing the field to the forefront of national attention through detailed, objective, and corrective spatial and statistical modeling.
An economist and statistician named Skipper Seabold attempted to reverse engineer the FiveThirtyEight model using Python. His efforts can be found at the following URL:
The application of geospatial modeling to politics is one of the most recent and visible case studies. However, the use of geospatial analysis has been increasing steadily over the last 15 years. In 2004, the US Department of Labor declared the geospatial industry one of 13 high-growth industries in the United States expected to create millions of jobs in the coming decades.
Geospatial analysis can be found in almost every industry including real estate, oil and gas, agriculture, defense, disaster management, health, transportation, and oceanography to name a few. For a good overview of how geospatial analysis is used in dozens of different industries visit: http://www.esri.com/what-is-gis/who-uses-gis.
Geospatial analysis can be traced as far back as 15,000 years ago, to the Lascaux Cave in southwestern France. In that cave, paleolithic artists painted commonly hunted animals and what many experts believe are astronomical star maps for either religious ceremonies or potentially even migration patterns of prey. Though crude, these paintings demonstrate an ancient example of humans creating abstract models of the world around them and correlating spatial-temporal features to find relationships. The following image shows one of the paintings with an overlay illustrating the star maps:
Over the centuries the art of cartography and the science of land surveying developed, but it wasn't until the 1800s that significant advances in geographic analysis emerged. Deadly cholera outbreaks in Europe between 1830 and 1860 led geographers in Paris and London to use geographic analysis for epidemiological studies.
In 1832, Charles Picquet used different half-toned shades of gray to represent deaths per thousand citizens in the 48 districts of Paris, as part of a report on the cholera outbreak. In 1854, John Snow expanded on this method by tracking a cholera outbreak in London as it occurred. By placing a point on a map of the city each time a case was diagnosed, he was able to analyze the clustering of cholera cases. Snow traced the disease to a single water pump and prevented further cases. The map has three layers with streets, an X for each pump, and dots for each cholera outbreak:
A retired French engineer named Charles Minard produced some of the most sophisticated infographics ever drawn between 1850 and 1870. The term infographics is too generic to describe these drawings because they have strong geographic components. The quality and detail of these maps make them fantastic examples of geographic information analysis even by today's standards. Minard released his masterpiece Carte figurative des pertes successives en hommes de l'ArmÃ©e FranÃ§aise dans la campagne de Russie 1812-1813, in 1869, depicting the decimation of Napoleon's army in the Russian campaign of 1812. The map shows the size and location of the army over time, along with prevailing weather conditions. The following graphic contains four different series of information on a single theme. It is a fantastic example of geographic analysis using pen and paper. The size of the army is represented by the widths of the brown and black swaths at a ratio of one millimeter for every 10,000 men. The numbers are also written along the swaths. The brown-colored path shows soldiers who entered Russia, while the black represents the ones who made it out. The map scale is shown on the center right as one "French league" (2.75 miles or 4.4 kilometers). The chart on the bottom runs from right to left and depicts the brutal freezing temperatures experienced by the soldiers on the return march home from Russia.
While far more mundane than a war campaign, Minard released another compelling map cataloguing the number of cattle sent to Paris from around France. Minard used pie charts of varying sizes in the regions of France to show each area's variety and volume of cattle shipped.
In the early 1900s, mass printing drove the development of the concept of map layersâa key feature of geospatial analysis. Cartographers drew different map elements (vegetation, roads, elevation contours) on plates of glass which could then be stacked and photographed for printing as a single image. If the cartographer made a mistake, only one plate of glass had to be changed instead of the entire map. Later the development of plastic sheets made it even easier to create, edit, and store maps in this manner. However, the layering concept for maps as a benefit to analysis would not come into play until the modern computer age.
Computer mapping evolved with the computer itself in the 1960s. But the origin of the term Geographic Information System (GIS) began with the Canadian Department of Forestry and Rural Development. Dr. Roger Tomlinson headed a team of 40 developers in an agreement with IBM to build the Canadian Geographic Information System (CGIS). The CGIS tracked the natural resources of Canada and allowed profiling of these features for further analysis. The CGIS stored each type of land cover as a different layer. The CGIS also stored data in a Canadian-specific coordinate system suitable for the entire country devised for optimal area calculations. While the technology used is primitive by today's standards, the system had phenomenal capability at that time. The CGIS included software features which seem quite modern: map projection switching, rubber sheeting of scanned images, map scale change, line smoothing and generalization to reduce the number of points in a feature, automatic gap closing for polygons, area measurement, dissolving and merging of polygons, geometric buffering, creation of new polygons, scanning, and digitizing of new features from reference data.
The National Film Board of Canada produced a 1967 documentary on the CGIS which can be seen at the following URL:
Tomlinson is often called "The Father of GIS". After launching the CGIS, he earned his doctorate from the University of London with his 1974 dissertation, entitled The application of electronic computing methods and techniques to the storage, compilation, and assessment of mapped data, which describes GIS and geospatial analysis. Tomlinson now runs his own global consulting firm, Tomlinson Associates Ltd., and remains an active participant in the industry. He is often found delivering the keynote address at geospatial conferences.
CGIS is the starting point of geospatial analysis as defined by this book. But this book would not have been written if not for the work of Howard Fisher and the Harvard Laboratory for Computer Graphics and Spatial Analysis, at the Harvard Graduate School of Design. His work on the SYMAP GIS software, which outputs maps to a line printer, started an era of development at the lab, which produced two other important packages and as a whole permanently defined the geospatial industry. GRID was a raster-based GIS system which used cells to represent geographic features instead of geometry. GRID was written by Carl Steinitz and David Sinton. The system later became IMGRID. Next came ODYSSEY. ODYSSEY was a team effort led by Nick Chrisman and David White. It was a system of programs which included many advanced geospatial data management features typical of modern geodatabase systems. Harvard attempted to commercialize these packages with limited success. However, their impact is still seen today. Virtually every existing commercial and open source package owes something to these code bases.
Howard Fisher produced a 1967 film using output from SYMAP to show the urban expansion of Lansing, Michigan from 1850 to 1965 by hand-coding decades of property information into the system. The analysis took months but would take only a few minutes to create now using modern tools and data. You can see the film at the following URL:
There are now dozens of graphical user interface geospatial desktop applications available today from companies including Esri, ERDAS, Intergraph, and ENVI to name a few. Esri is the oldest continuously operating GIS software company, which started in the late 1960s. In the open source realm, packages including Quantum GIS (QGIS) and GRASS are widely used. Beyond comprehensive desktop software packages, software libraries for building new software exist in the thousands.
Remote sensing is the collection of information about an object without making physical contact with that object. In the context of geospatial analysis, the object is usually the Earth. Remote sensing also includes the processing of the collected information. The potential of geographic information systems is limited only by the available geographic data. The cost of land surveying, even using a modern GPS, to populate a GIS has always been resource intensive. The advent of remote sensing not only dramatically reduced that cost of geospatial analysis, but it took the field in entirely new directions. In addition to powerful reference data for GIS systems, remote sensing has made possible the automated and semi-automated generation of GIS data by extracting features from images and geographic data.
The eccentric French photographer Gaspard-FÃ©lix Tournachon, also known as Nadar, took the first aerial photograph in 1858 from a hot air balloon over Paris. The value of a true bird's eye view of the world was immediately apparent. As early as 1920, the books on aerial photo interpretation began to appear.
When America entered the cold war with the Soviet Union after World War II, aerial photography for monitoring military capability became prolific with the invention of the American U2 spy plane. The U2 spy plane could fly at 75,000 feet, putting it out of range of existing anti-aircraft weapons designed to reach only 50,000 feet. The American U2 flights over Russia ended when the Soviets finally shot down a U2 and captured the pilot.
But aerial photography had little impact on modern geospatial analysis. Planes could only capture small footprints of an area. Photographs were tacked to walls or examined on light tables but not in the context of other information. Though extremely useful, aerial photo interpretation was simply another visual perspective.
The game changer came on October 4, 1957, when the Soviet Union launched the Sputnik 1 satellite. The Soviets had scrapped a much more complex and sophisticated satellite prototype because of manufacturing difficulties. Once corrected, this prototype would later become Sputnik 3. They opted instead for a simple metal sphere with 4 antennae and a simple radio transmitter. Other countries including the United States were also working on satellites. The satellite initiatives were not entirely a secret. They were driven by scientific motives as part of the International Geophysical Year. Advancement in rocket technology made artificial satellites a natural evolution for earth science. However, in nearly every case each country's defense agency was also heavily involved. Like the Soviets, other countries were struggling with complex satellite designs packed with scientific instruments. The Soviets' decision to switch to the simplest possible device for the sole reason of launching a satellite before the Americans was effective. Sputnik was visible in the sky as it passed over and its radio pulse could be heard by amateur radio operators. Despite Sputnik's simplicity, it provided valuable scientific information which could be derived from its orbital mechanics and radio frequency physics.
The Sputnik program's biggest impact was on the American space program. America's chief adversary had gained a tremendous advantage in the race to space. The United States ultimately responded with the Apollo moon landings. But, before that, the US launched a program that would remain a national secret until 1995. The classified CORONA program resulted in the first pictures from space. The US and Soviet Union had signed an agreement to end spy plane flights but satellites were conspicuously absent from the negotiations. The following map shows the CORONA process. Dashed lines are satellite flight paths, longer white tubes are the satellite, the smaller white cones are the film canisters, and the black blobs are the control stations that triggered the ejection of the film so a plane could catch it in the sky.
The first CORONA satellite was a four year effort with many setbacks. But the program ultimately succeeded. The difficulty of satellite imaging even today is retrieving the images from space. The CORONA satellites used canisters of black and white film which were ejected from the vehicle once exposed. As the film canister parachuted to earth, a US military plane would catch the package in midair. If the plane missed the canister it would float for a brief duration in the water before sinking into the ocean to protect the sensitive information. The US continued to develop the CORONA satellites until they matched the resolution and photographic quality of the U2 spy plane photos. The primary disadvantages of the CORONA instruments were reusability and timeliness. Once out of film a satellite could no longer be of service. Also, the film recovery was on a set schedule making the system unsuitable to monitor real-time situations. The overall success of the CORONA program, however, paved the way for the next wave of satellites, which ushered in the modern era of remote sensing.
Because of the CORONA program's secret status, its impact on remote sensing was indirect. Photographs of the earth taken on manned US space missions inspired the idea of a civilian-operated remote sensing satellite. The benefits of such a satellite were clear but the idea was still controversial. Government officials questioned whether a satellite was as cost efficient as aerial photography. The military were worried the public satellite could endanger the secrecy of the CORONA program. And yet other officials worried about the political consequences of imaging other countries without permission. But the Department of the Interior finally won permission for NASA to create a satellite to monitor earth's surface resources.
On July 23, 1972, NASA launched the Earth Resources Technology Satellite (ERTS). The ERTS was quickly renamed to Landsat-1. The platform contained two sensors. The first was the Return Beam Vidicon (RBV) sensor, which was essentially a video camera. It was even built by the radio and television giant RCA. The RBV immediately had problems including disabling the satellite's altitude guidance system. The second attempt at a satellite was the highly experimental Multi-Spectral Scanner or MSS. The MSS performed flawlessly and produced superior results to the RBV. The MSS captured four separate images at four different wavelengths of the light reflected from the earth's surface.
This sensor had several revolutionary capabilities. The first and most important capability was the first global imaging of the planet scanning every spot on the earth every 16 days. The following image from the US National Aeronautics and Space Administration (NASA) illustrates this flight and collection pattern:
It also recorded light beyond the visible spectrum. While it did capture green and red light visible to the human eye, it also scanned near-infrared light at two different wavelengths not visible to the human eye. The images were stored and transmitted digitally to three different ground stations in Maryland, California, and Alaska. The multispectral capability and digital format meant the aerial view provided by Landsat wasn't just another photograph from the sky. It was beaming down data. This data could be processed by computers to output derivative information about the earth in the same way a GIS provided derivative information about the earth by analyzing one geographic feature in the context of another. NASA promoted the use of Landsat worldwide and made the data available at very affordable prices to anyone who asked.
This global imaging capability led to many scientific breakthroughs including the discovery of previously unknown geography as late as 1976. Using Landsat imagery the government of Canada located a tiny uncharted island inhabited by polar bears. They named the new landmass Landsat Island.
Landsat-1 was followed by six other missions and turned over to the National Oceanic and Atmospheric Administration (NOAA) as the responsible agency. Landsat-6 failed to achieve orbit due to a ruptured manifold, which disabled its maneuvering engines. During some of those missions the satellites were managed by the company EOSAT, now called Space Imaging, but returned to government management by the Landsat-7 mission. The following image from NASA is a sample of a Landsat 7 product:
The Landsat Data Continuity Mission (LDCM) launched February 13, 2013 and began collecting images on April 27, 2013 as part of its calibration cycle to become Landsat 8. The LDCM is a joint mission between NASA and the United States Geological Survey (USGS).
A Digital Elevation Model (DEM) is a three-dimensional representation of a planet's terrain. Within the context of this book that planet is Earth. The history of digital elevation models is far less complicated than remotely-sensed imagery but no less significant. Before computers, representations of elevation data were limited to topographic maps created through traditional land surveys. Technology existed to create 3D models from stereoscopic images or physical models from materials such as clay or wood, but these approaches were not widely used for geography.
The concept of digital elevation models began in 1986 when the French space agency, CNES, launched its SPOT-1 satellite which included a stereoscopic radar. This system created the first usable DEM. Several other US and European satellites followed this model with similar missions. In February 2000 the Space Shuttle Endeavour conducted the Shuttle Radar Topography Mission (SRTM), which collected elevation data over 80 percent of the earth's surface using a special radar antenna configuration that allowed a single pass. This model was surpassed in 2009 by the joint US and Japanese mission using the ASTER sensor aboard NASA's TERRA satellite. This system captured 99 percent of the earth's surface but has proven to have minor data issues. SRTM remains the gold standard. The following image from the US Geological Survey (USGS) shows a colorized DEM known as a hillshade. Greener areas are lower elevations while yellow and brown areas are mid-range to high elevations:
Recently more ambitious attempts at a worldwide elevation data set are underway in the form of TerraSAR-X and TanDEM-X satellites launched by Germany in 2007 and 2010, respectively. These two radar elevation satellites are working together to produce a global DEM, called WorldDEM, planned for release in 2014. This data set will have a relative accuracy of 2 meters and an absolute accuracy of 10 meters.
Computer-aided drafting (CAD) is worth mentioning, though it does not directly relate to geospatial analysis. The history of CAD system development parallels and intertwines with the history of geospatial analysis. CAD is an engineering tool used to model two- and three-dimensional objects usually for engineering and manufacturing. The primary difference between a geospatial model and a CAD model is a geospatial model is referenced to the earth, whereas a CAD model can possibly exist in abstract space. For example, a 3D blueprint of a building in a CAD system would not have a latitude or longitude. But in a GIS, the same building model would have a location on the earth. However, over the years CAD systems have taken on many features of GIS systems and are commonly used for smaller GIS projects. And likewise, many GIS programs can import CAD data which have been georeferenced. Traditionally, CAD tools were designed primarily for engineering data that were not geospatial.
However, engineers who became involved with geospatial engineering projects, such as designing a city utility electric system, would use the CAD tools they were familiar with to create maps. Over time both GIS software evolved to import the geospatial-oriented CAD data produced by engineers, and CAD tools evolved to better support geospatial data creation and better compatibility with GIS software. AutoCAD by AutoDesk and ArcGIS by Esri were the leading commercial packages to develop this capability and the GDAL OGR library developers added CAD support as well.
Modern geospatial analysis can be conducted with the click of a button in any of the easy-to-use commercial or open source geospatial packages. So then why would you want to use a programming language to learn this field? The most important reasons are:
You want complete control of the underlying algorithms, data, and execution.
You want to automate a specific, repetitive analysis task with minimal overhead
You want to create a program that's easy to share
You want to learn geospatial analysis beyond pushing buttons in software
The geospatial industry is gradually moving away from the traditional workflow in which teams of analysts use expensive desktop software to produce geospatial products. Geospatial analysis is being pushed towards automated processes which reside in the cloud. End user software is moving towards task-specific tools, many of which are accessed from mobile devices. Knowledge of geospatial concepts and data as well as the ability to build custom geospatial processes are where the geospatial work in the near future lies.
Object-oriented programming is a software development paradigm in which concepts are modeled as objects which have properties and behaviors represented as attributes and methods, respectively. The goals of this paradigm are more modular software in which one object can inherit from one or more other objects to encourage software reuse.
The Python programming language is known for its ability to serve multiple roles as a well-designed, object-oriented language, a procedural scripting language, or even a functional programming language. However, you never completely abandon object-oriented programming in Python because even its native data types are objects and all Python libraries, known as modules, adhere to a basic object structure and behavior.
Geospatial analysis is the perfect activity for object-oriented programming. The concepts modeled in geospatial analysis are, well, objects! The domain of geospatial analysis is the Earth and everything on it. Trees, buildings, rivers, and people are all examples of objects within a geospatial system.
A common example in literature for newcomers to object-oriented programming is the concrete analogy of a cat. Books on object-oriented programming frequently use some form of the following example:
Imagine you are looking at a cat. We know some information about the cat, such as its name, age, color, and size. These features are properties of the cat. The cat also exhibits behaviors such as eating, sleeping, jumping, and purring. In object-oriented programming, objects have properties and behaviors too. You can model a real-world object like the cat in our example, or something more abstract such as a bank account.
Most concepts in object-oriented programming are far more abstract than the simple cat paradigm or even the bank account in this common example. However, in geospatial analysis the objects modeled remain concrete, like the simple cat analogy, and in many cases are cats. Geospatial analysis allows you to continue with the simple cat analogy and even visualize it. The following map represents the feral cat population of Australia using data provided by the Atlas of Living Australia:
Geospatial analysis helps people make better decisions. It doesn't make the decision for you, but it can answer critical questions which are at the heart of the choice to be made and often cannot be answered any other way. Until recently geospatial technology and data were tools available only to governments, and well-funded researchers. But in the last decade data have become much more widely available and software much more accessible to anyone.
In addition to freely available government satellite imagery, many local governments now conduct aerial photo surveys and make the data available online. The ubiquitous Google Earth provides a cross-platform spinning globe view of the Earth with satellite and aerial data, streets, points of interest, photographs, and much more. Google Earth users can create custom KML files, which are XML files to load and style data onto the globe. This program and similar tools are often called geographic exploration tools, because they are excellent data viewers but provide very limited data analysis capability.
The ambitious OpenStreetMap project (http://openstreetmap.org) is a crowd-sourced, worldwide, geographic basemap containing most layers commonly found in a GIS. Nearly every mobile phone contains a GPS now, along with mobile apps to collect GPS tracks as points, lines, or polygons. Most phones will also tag photos taken with the phone's camera with a GPS coordinate. In short, anyone can be a geospatial analyst.
The global population has reached seven billion people. And the world is changing faster than ever before. The planet is undergoing environmental changes never seen before in recorded history. Faster communication and faster transportation increase the interaction between us and the environment in which we live. Managing people and resources safely and responsibly is more challenging than ever. Geospatial analysis is the best approach to understanding our world more efficiently and deeply. The more politicians, activists, relief workers, parents, teachers, first responders, medical professionals, and small businesses harness the power of geospatial analysis the more our potential for a better, healthier, safer, fairer world will be realized.
In order to begin geospatial analysis, it is important to understand some key underlying concepts unique to the field. The list isn't long but nearly every aspect of analysis traces back to one of these ideas.
A thematic map portrays a specific theme as its name suggests. A general reference map visually represents features as they relate geographically for navigation or planning. A thematic map goes beyond location to provide the geographic context for information around a central idea. Usually a thematic map is designed for a targeted audience to answer specific questions. The value of thematic maps lies in what they do not show. A thematic map will use minimal geographic features to avoid distracting the reader from the theme. Most thematic maps include political boundaries such as country or state borders but omit navigational features, such as street names or points of interest beyond major landmarks which orient the reader. The cholera map earlier in this chapter is a perfect example of a thematic map. Common uses for thematic maps are visualizing health issues, such as disease, election results, and environmental phenomena such as rainfall. These maps are also the most common output of geospatial analysis. The following map from the US Census Bureau shows cancer mortality rates by state:
Thematic maps tell a story and are very useful. However, it is important to remember that while thematic maps are models of reality like any other map, they are also generalizations of information. Two different analysts using the same source information will often come up with very different thematic maps depending on how they analyze and summarize the data. The technical nature of thematic maps often leads people to treat them as if they are scientific evidence. But geospatial analysis is never conclusive. While the analysis may be based on scientific data the analyst does not follow the rigor of the scientific method. In his classic book How to Lie with Maps, Mark Monmonier demonstrates in great detail how maps are easily manipulated models of reality, which are commonly abused. This fact doesn't degrade the value of these tools. The legendary statistician George Box wrote in his 1987 book Empirical Model-Building and Response Surfaces, "Essentially, all models are wrong, but some are useful." Thematic maps have been used as guides to start (and end) wars, stop deadly disease in its tracks, win elections, feed nations, fight poverty, protect endangered species, and rescue those impacted by disaster. Thematic maps may be the most useful models ever created.
In its purest form, a database is simply an organized collection of information. A database management system (DBMS) is an interactive suite of software that can interact with a database. People often use the word "database" as a catch-all term referring to both the DBMS and the underlying data structure. Databases typically contain alpha-numeric data and in some cases binary large objects, or blobs, which can store binary data, such as images. Most databases also allow a relational database structure in which entries in normalized tables can be referenced to each other to create many-to-one and one-to-many relationships among data.
Spatial databases use specialized software to extend a traditional relational DBMS or RDMS to store and query data defined in two-dimensional or three-dimensional space. Some systems also account for a series of data over time. In a spatial database, attributes about geographic features are stored and queried as traditional relational database structures. The spatial extensions allow you to query geometries using Structured Query Language (SQL) in a similar way to traditional database queries. Spatial queries and attribute queries can also be combined to select results based on both location and attributes.
Spatial indexing is a process that organizes geospatial vector data for faster retrieval. It is a way of prefiltering the data for common queries or rendering. Indexing is commonly used in large databases to speed up returns to queries. Spatial data is no different. Even a moderately-sized geodatabase can contain millions of points or objects. If you perform a spatial query, every point in the database must be considered by the system in order to include it or eliminate it in the results. Spatial indexing groups data in ways that allow large portions of the data set to be eliminated from consideration by doing computationally simpler checks before going into detailed and slower analysis of the remaining items.
Metadata is defined as data about data. Accordingly, geospatial metadata is data about geospatial data sets that provides traceability for the source and history of a data set as well as summary technical details. Metadata also provides long-term preservation of information holdings. Geospatial metadata can be represented by several possible standards. One of the most prominent standards is international standard ISO 19115-1, which includes hundreds of potential fields to describe a single geospatial data set. Example fields include spatial representation, temporal extent, and lineage. The primary use of metadata is cataloging data sets. Modern metadata can be ingested by geographic search engines making it potentially automatically discoverable by other systems. It also lists points of contact for a data set if you have questions. Metadata is an important support tool for geospatial analysts and adds credibility and accessibility to your work.
Map projections can be a challenge for new analysts. If you take any three-dimensional object and flatten it onto a plane, such as your screen or a sheet of paper, the object is distorted. Many grade school geography classes demonstrated this concept by having students peel an orange and then attempt to lay the peel flat on their desk to understand the resulting distortion. The same effect occurs when you take the round shape of the earth and project it onto a computer screen.
In geospatial analysis, you can manipulate this distortion to preserve common properties, such as area, scale, bearing, distance, or shape. There is no one-size-fits-all solution to map projections. The choice of projection is always a compromise of gaining accuracy in one dimension in exchange for error in another. Projections are typically represented as a set of over 40 parameters as either XML or a text format called Well-Known Text or WKT, used to define the transformation algorithm.
The International Association of Oil and Gas Producers maintains a registry of most known projections. The organization was formerly known as the EPSG. The entries in the registry are still known as EPSG codes. The EPSG maintained the registry as a common benefit for the oil and gas industry, which is a prolific user of geospatial analysis for energy exploration. At last count that registry contained over 5,000 entries.
As recently as 10 years ago, map projections were a primary concern for a geospatial analyst. Data storage was expensive, high-speed Internet was rare, and cloud computing didn't really exist. Geospatial data was typically exchanged among small groups working in separate areas of interest. The technology constraints at the time meant geospatial analysis was highly localized. Analysts would use the best projection for their area of interest. Data in different projections cannot be displayed on the same map because they represent two different models of the earth. Any time an analyst received data from a third party it had to be reprojected before using it with existing data. This process was tedious and time consuming. Most geospatial data formats do not provide a way to store the projection information. That information is stored in an ancillary file usually as text or XML. Because analysts didn't exchange data often, many people wouldn't bother defining projection information. Every analyst's nightmare was to come across an extremely valuable data set missing the projection information. It rendered the data useless. The coordinates in the file are just numbers and offer no clue to the projection. With over 5,000 choices it was nearly impossible to guess.
But now, thanks to modern software and the Internet making data exchange easier and more common, nearly every data format has added on a metadata format that defines the projection or places it in the file header if supported. Advances in technology have also allowed for global basemaps, which allow for more common uses of projections like the common Google Mercator projection used for Google Maps. Geospatial portal projects like OpenStreetMap.org and NationalAtlas.gov have consolidated data sets for much of the world in common projections. Modern geospatial software can also reproject data on the fly saving the analyst the trouble of pre-processing the data before using it.
Geographic data including points, lines, and polygons are stored numerically as one or more points, which come in (x,y) pairs or (x,y,z) tuples. The x represents the horizontal axis on a graph. The y represents the vertical axis. The z represents terrain elevation. In computer graphics, a computer screen is represented by an x and y axis. A z axis in not used because the computer screen is treated as a two-dimensional plane by most graphics software APIs.
Another important factor is screen coordinates versus world coordinates. Geographic data is stored in a coordinate system representing a grid overlaid on the earth, which is three-dimensional and round. Screen coordinates, also known as pixel coordinates, represent a grid of pixels on a flat, two-dimensional computer screen. Mapping x and y world coordinates to pixel coordinates is fairly straightforward and involves a simple scaling algorithm. However, if a z coordinate exists then a more complicated transform must be performed to map coordinates from 3D space to a 2D plane. These transformations can be computationally costly and therefore slow if not handled correctly.
In the case of remote sensing data, the challenge is typically file size. Even a moderately sized satellite image, compressed, can be tens, if not hundreds of megabytes. Images can be compressed using lossless or lossy methods. Lossless methods use tricks to reduce file size without discarding any data. Lossy compression algorithms reduce file size by reducing the amount of data in the image while avoiding a significant change in appearance of the image. Rendering an image on the screen can be computationally intensive. Most remote sensing file formats allow for storing multiple lower-resolution versions of the image, called overviews or pyramids, for the sole purpose of faster rendering at different scales. When zoomed out from the image to a scale where you couldn't see the detail of the full resolution image, a pre-processed, lower-resolution version of the image is displayed quickly and seamlessly.
Most of the GIS concepts described also apply to raster data. However, raster data has some unique properties as well. Earlier in this chapter in the history of remote sensing, the focus was on earth imaging from aerial platforms. It is important to note that raster data can come in many forms including ground-based radar, laser range finders, and other specialized devices for detecting gases, radiation, and other forms of energy within a geographic context. For the purpose of this book, we will focus on remote sensing platforms that capture large amounts of earth data. These sources included earth imaging systems but also certain types of elevation data, and some weather systems where applicable.
Raster data is captured digitally as square tiles. This means the data is stored on a computer as a numerical array of rows and columns. If the data is multispectral, the data set will usually contain multiple arrays of the same size, which are geospatially referenced together to represent a single area on the earth. These different arrays are called bands. Any numerical array can be represented on a computer as an image. In fact, all computer data is ultimately numbers. It is important in geospatial analysis to think of images as a numeric array because mathematical formulas are used to process them.
In remotely sensed images, each pixel represents both space (location on the earth of a certain size), and the reflectance captured as light reflected from the earth at that location into space. So each pixel has a ground size and contains a number representing the intensity. Because each pixel is a number, we can perform math equations on this data to combine data from different bands and highlight specific classes of objects in the image. And if the wavelength value is beyond the visible spectrum we can highlight features not visible to the human eye. Substances such as chlorophyll in plants can be greatly contrasted using a specific formula called the normalized vegetation differential index or NDVI.
By processing remotely sensed images, we can turn these data into visual information. Using the NDVI formula we can answer the question, What is the relative health of the plants in this image? But you can also create new types of digital information, which can be used as input for computer programs to output other types of information.
Computer screens display images as combinations of red, green, and blue (RGB) to match the capability of the human eye. Satellites and other remote sensing imaging devices can capture light beyond that visible spectrum. On a computer, wavelengths beyond the visible spectrum are represented in the visible spectrum so we can see them. In remote sensing, infrared light makes moisture highly visible. This phenomenon has a variety of uses such as monitoring ground saturation during a flood or finding hidden leaks in a roof or a levee.
This section will discuss different types of GIS processes commonly used in geospatial analysis. This list is not exhaustive; however, it provides the essential operations on which all other operations are based. If you understand these operations you can quickly understand much more complex processes as they are either derivatives or combinations of these processes.
GIS vector data uses coordinates consisting of, at a minimum, an x horizontal value and a y vertical value to represent a location on the earth. In many cases a point may also contain a z value. Other ancillary values are possible including measurements or timestamps.
These coordinates are used to form points, lines, and polygons to model real-world objects. Points can be a geometric feature in and of themselves, or they can connect line segments. Closed areas created by line segments are considered polygons. Polygons model objects such as buildings, terrain, or political boundaries.
A GIS feature can consist of a single point, line, or polygon or it can consist of more than one shape. For example, in a GIS polygon data set containing world country boundaries, the Philippines, which is made up of 7,107 islands, would be represented as a single country made up of thousands of polygons.
Vector data typically represents topographic features better than raster data. Vector data has better accuracy potential and is more precise. But vector data is also traditionally more costly to collect on a large scale than raster data.
Two other important terms related to vector data structures are bounding box and convex hull. The bounding box or minimum bounding box is the smallest possible square which contains all of the points in a data set. The following image demonstrates a bounding box for a collection of points:
The convex hull of a data set is similar to the bounding box but instead of a square it is the smallest possible polygon which can contain a data set. The bounding box of a data set always contains its convex hull. The following image shows the same point data as the previous example with the convex hull polygon shown in red:
A buffer operation can be applied to spatial objects including points, lines, or polygons. This operation creates a polygon around the object at a specified distance. Buffer operations are used for proximity analysis, for example, establishing a safety zone around a dangerous area. In the following image, the black shapes represent the original geometry while the red outlines represent the larger buffer polygons generated from the original shape:
A dissolve operation creates a single polygon out of adjacent polygons. A common use for a dissolve operation is to merge two adjacent properties in a tax database that have been purchased by a single owner. Dissolves are also used to simplify data extracted from remote sensing.
Objects which have more points than necessary for the geospatial model can be generalized to reduce the number of points used to represent the shape. This operation usually requires a few attempts to get the optimal number of points without compromising the overall shape. It is a data optimization technique to simplify data for computing efficiency or better visualization. This technique is useful in web-mapping applications. Computer screens have a resolution of 72 Dots Per Inch (dpi). Highly detailed point data, which would not be visible, can be reduced so less bandwidth is used to send a visually-equivalent map to the user.
An intersection operation is used to see if one part of a feature intersects with one or more features. This operation is for spatial queries in proximity analysis and is often a follow-on operation to a buffer analysis.
A merge operation combines two or more non-overlapping shapes into a single multishape object. Multishape objects mean the shapes maintain separate geometries but are treated as a single feature with a single set of attributes by the GIS.
A fundamental geospatial operation is checking to see if a point is inside a polygon. This one operation is the atomic building block of many different types of spatial queries. If the point is on the boundary of the polygon it is considered inside. Very few spatial queries exist that do not rely on this calculation in some way. It can be very slow on a large number of points, however.
The most common and efficient algorithm to detect if a point is inside a polygon is called the Ray Casting algorithm. First a test is performed to see if the point is on the polygon boundary. Next the algorithm draws a line from the point in question in a single direction. The program counts the number of times the line crosses the polygon boundary until it reaches the bounding box of the polygon. The bounding box is the smallest box which can be drawn around the entire polygon. If the number is odd, the point is inside. If the number of boundary intersections is even, the point is outside.
The union operation is less common but very useful for combining two or more overlapping polygons into a single shape. It is similar to the dissolve but in this case the polygons are overlapping as opposed to being adjacent. Usually this operation is used to clean up automatically generated feature data sets from remote sensing operations.
A join or SQL join is a database operation used to combine two or more tables of information. Relational databases are designed to avoid storing redundant information for one-to-many relationships. For example, a US state may contain many cities. Rather than creating a table for each state containing all of its cities, a table of states with numeric IDs is created, while a table for all cities in every state is created with a state numeric ID. In a spatial join operation the state and cities tables could be linked by state ID. In a GIS, you can also have spatial joins which are part of the spatial extension software for a database. In spatial joins, combine the attributes to two features the same way you do in an SQL join, but the relation is based on the spatial proximity of the two features. To follow the previous cities example, we could add the county name each city resides in using a spatial join. The cities layer could be loaded over a county polygon layer whose attributes contain the county name. The spatial join would determine which city is in which county and perform an SQL join to add the county name to each city's attribute row.
Polygons must have at least four points (no triangles)
A polygon boundary should not overlap itself
Polygons within a layer shouldn't overlap
A polygon within a layer inside another polygon is considered a hole in the underlying polygon
Different geospatial software packages and libraries handle exceptions to these rules differently and can lead to confusing errors or software behavior. The safest route is to make sure your polygons obey these rules. There is one more important piece of information about polygons. A polygon is by definition a closed shape, meaning the first and last vertices of the polygon are identical. Some geospatial software will throw an error if you haven't explicitly duplicated the first point as the last point in the polygon data set. Other software will automatically close the polygon without complaining. The data format you use to store your geospatial data may also dictate how polygons are defined. This issue is a gray area so it didn't make the polygon rules but knowing this quirk will come in handy someday when you run into an error you can't explain easily.
Remote sensing contains thousands of operations which can be performed on data. And this field changes on an almost daily basis as new satellites are put into space and computer power increases. Despite its decades long history, we haven't even scratched the surface of the knowledge this field can provide the human race. Once again, similar to the common GIS processes, this minimal list of operations gives you the basis to evaluate any technique used in remote sensing.
Band math is multidimensional array mathematics. In array math, arrays are treated as single units, which are added, subtracted, multiplied, and divided. But in an array the corresponding numbers in each row and column across multiple arrays are computed simultaneously.
Change detection is the process of taking two images of the same location at different times and highlighting the changes. A change can do the addition of something on the ground, such as a new building or the loss of a feature, such as coastal erosion. There are many algorithms for detecting changes among images and also determining qualitative factors, such as how long ago the change took place. The following image from a research project by the US Oak Ridge National Laboratory shows rainforest deforestation between 1984 and 2000 in the state of Rondonia, Brazil. Colors are used to show how recently the forest was cut. Green represents virgin rain forest, white is forest cut within 2 years of the end of the date range, red within 22 years, and the other colors fall in between as described in the legend:
A histogram is the statistical distribution of values in a data set. The horizontal axis represents a unique value in a data set while the vertical axis represents the frequency of that unique value within the raster. A histogram is a key operation in most raster processing. It can be used for everything from enhancing contrast in an image to serving as a basis for object classification and image comparison. The following example from NASA shows a histogram for a satellite image which has been classified into different categories representing the underlying surface feature:
Feature extraction is the manual or automatic digitization of features in an image to points, lines, or polygons. This process serves as the basis for the vectorization of images in which a raster is converted to a vector data set. An example of feature extraction is extracting a coastline from a satellite image and saving it as a vector data set. If this extraction is performed over several years you could monitor the erosion or other changes along that coastline.
Objects on the earth reflect different wavelengths of light depending on the material they are made of. In remote sensing, analysts collect wavelength signatures for specific types of land cover (for example, concrete) and build a library for a specific area. A computer can then use that library to automatically locate classes in that library in a new image of that same area.
Now that we have a better understanding of geospatial analysis, the next step is to build a simple GIS using Python called
SimpleGIS! This small program will be a technically complete GIS with a geographic data model and the ability to render the data as a visual thematic map showing population of different cities.
The data model will also be structured so that you can perform basic queries. Our
SimpleGIS will contain the state of Colorado, three cities, and population counts for each city.
Most importantly we will demonstrate the power and simplicity of Python programming by building this tiny system in pure Python. We will only use modules available within the standard Python distribution without downloading any third-party libraries.
The only module used in the following example is the Turtle module which provides a very simple graphics engine based on the Tkinter library included with Python. If you used the installers for Windows or Mac OS X the Tkinter library should be included already. If you compiled Python yourself or are using a distribution from somewhere besides Python.org then check to make sure you can import the Turtle module by typing the following at a command prompt to run the turtle demo script:
python âm turtle
If your Python distribution does not have Tkinter, you can find information on installing it from the following page. The information is for Python 2.3 but the process is still the same:
The official Python wiki page for Tkinter can be found here:
If you are new to Python, Dive into Python is a free online book, which covers all the basics of Python and will bring you up to speed:
The code is divided into two different sections. The first is the data model section and the second is the map renderer that draws that data. For the data model we will use simple Python lists. A Python list is a native data type, which serves as a container for other Python objects in a specified order. Python lists can contain other lists and are great for simple data structures. They also map well to more complex structures or even databases if you decide you want to develop your script further.
The second portion of the code will render the map using the Python Turtle graphics engine. We will have only one function in the GIS that converts the world coordinates, in this case longitude and latitude, into pixel coordinates. All graphics engines have an origin point of (0,0), which is usually in the top-left or lower-left corner of the canvas. Turtle graphics are designed to teach programming visually. The Turtle graphics canvas uses an origin of (0,0) in the center, similar to a graphing calculator. The following image illustrates the type of Cartesian graph the Turtle module uses. In the following graph, some points are plotted in both positive and negative space:
This also means the Turtle graphics engine can have negative pixel coordinates, which is uncommon for graphics canvases. But for this example the Turtle module is the quickest and simplest way to render our map.
You can run this program interactively in the Python interpreter or you can save the complete program as a script and run it. The Python interpreter is an incredibly powerful way to learn new concepts because it gives you real-time feedback on errors or unexpected program behavior. You can easily recover from these issues and try something else until you get the results you want.
In Python, you usually import modules at the beginning of the script so we'll import the Turtle module first. We'll use Python's
import as feature to assign the module the name
t to save space and time when typing Turtle commands:
import turtle as t
Next we'll set up the data model starting with some simple variables that allow us to access list indexes by name instead of numbers to make the code easier to follow. Python lists index the contained objects starting with the number 0. So if we want to access the first item in a list called
myList we would reference it like this:
But to make our code easier to read we can also use a variable name assigned to commonly used indexes:
firstItem = 0 myList[firstItem]
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
In computer science, assigning commonly used numbers to an easy-to-remember variable is a common practice. These variables are called constants.
So, for our example, we'll assign some constants for some common elements used for all the cities. All cities will have a name, one or more points, and a population count:
NAME = 0 POINTS = 1 POP = 2
state = ["COLORADO", [[-109, 37],[-109, 41],[-102, 41],[-102, 37]], 5187582]
The cities will be stored as nested lists. Each city's location consists of a single point as a longitude and latitude pair. These entries will complete our GIS data model. We'll start with an empty list called
cities and then append the data to that list for each city:
cities =  cities.append(["DENVER",[-104.98, 39.74], 634265]) cities.append(["BOULDER",[-105.27, 40.02], 98889]) cities.append(["DURANGO",[-107.88,37.28], 17069])
We will now render our GIS data as a map by first defining a map size. The width and height can be anything you want up to your screen resolution:
map_width = 400 map_height = 300
In order to scale the map to the graphics canvas, we must first determine the bounding box of the largest layer which is the state. We'll set the map bounding box to a global scale and reduce it to the size of the state. To do so we'll loop through the longitude and latitude of each point and compare it to the current minimum and maximum x and y values. If it is larger than the current maximum or smaller than the current minimum we'll make that value the new maximum or minimum respectively:
minx = 180 maxx = -180 miny = 90 maxy = -90 for x,y in state[POINTS]: if x < minx: minx = x elif x > maxx: maxx = x if y < miny: miny = y elif y > maxy: maxy = y
The second step to scaling is to calculate a ratio between the actual state and the tiny canvas we will render it upon. This ratio is used for coordinate to pixel conversion. We get the size along the x and y axis of the state and then we divide the map width and height by those numbers to get our scaling ratio:
dist_x = maxx - minx dist_y = maxy - miny x_ratio = map_width / dist_x y_ratio = map_height / dist_y
The following function called
convert() is our only function in
SimpleGIS. It transforms a point in map coordinates from one of our data layers to pixel coordinates using the previous calculations. You'll notice at the end, we divide the map width and height in half and subtract it from the final conversion to account for the unusual center origin of the Turtle graphics canvas. Every geospatial program has some form of this function:
def convert(point): lon = point lat = point x = map_width - ((maxx - lon) * x_ratio) y = map_height - ((maxy - lat) * y_ratio) # Python turtle graphics start in the middle of the screen # so we must offset the points so they are centered x = x - (map_width/2) y = y - (map_height/2) return [x,y]
Now for the exciting part! We're ready to render our GIS as a thematic map. The Turtle module uses the concept of a cursor called a pen. And moving the cursor around the canvas is exactly like moving a pen around a piece of paper. The cursor will draw a line when you move it. So you'll notice throughout the code we use the commands
t.down() to pick the pen up when we want to move to a new location, and put it down when we're ready to draw. We have some important steps in this section. Because the border of Colorado is a polygon, we must draw a line between the last point and the first point to close the polygon. We could also have left out the closing step and just added a duplicate point to the Colorado data set. Once we draw the state, we'll use the
write() method to label the polygon:
t.up() first_pixel = None for point in state[POINTS]: pixel = convert(point) if not first_pixel: first_pixel = pixel t.goto(pixel) t.down() t.goto(first_pixel) t.up() t.goto([0,0]) t.write(state[NAME], align="center", font=("Arial",16,"bold"))
If you do try to run the code you'll need to temporarily add the following line at the end, or the Tkinter window will close as soon as it finishes drawing.
Now we'll render the cities as point locations and label them with their name and population. Since the cities are a group of features in a list, we'll loop through them to render them. Instead of drawing lines by moving the pen around, we'll use the Turtle
dot() method to plot a small circle at the pixel coordinate returned by our
convert() function. We'll then label the dot with the city name and add the population. You'll notice we must convert the population number to a string in order to use it in the Turtle
write() method. To do so we use Python's built-in function
for city in cities: pixel = convert(city[POINTS]) t.up() t.goto(pixel) # Place a point for the city t.dot(10) # Label the city t.write(city[NAME] + ", Pop.: " + str(city[POP]), align="left") t.up()
Now we will perform one last operation to prove that we have created a real GIS. We will perform an attribute query on our data to determine which city has the largest population. Then we'll perform a spatial query to see which city lies the furthest west. Finally we'll print the answers to our questions on our thematic map page safely out of range of the map.
As our query engine we'll use Python's built-in
max() functions. These functions take a list as an argument and return the minimum and maximum values of that list. Because we are dealing with nested lists in our data model we'll take advantage of the
key argument in those functions. The key argument accepts a function that temporarily alters the list for evaluation before a final value is returned. In this case, we want to isolate the population values for comparison and then the points. We could write a whole new function to return the specified value but instead we can use Python's
lambda keyword. The lambda keyword defines an anonymous function that is used inline. Other Python functions can be used inline, for example, the string function:
str(), but they are not anonymous. This temporary function will isolate our value of interest.
So our first question is, which city has the largest population?
biggest_city = max(cities, key=lambda city:city[POP]) t.goto(0,-200) t.write("The biggest city is: " + biggest_city[NAME])
Next question: which city lies the furthest west?
western_city = min(cities, key=lambda city:city[POINTS]) t.goto(0,-220) t.write("The western-most city is: " + western_city[NAME])
In the preceding query, we use Python's built in
min() function to select the smallest longitude value which works because we represented our city locations as longitude, latitude pairs. It is possible to use different representations for points including possible representations where this code would need modification to work correctly. But for our
SimpleGIS we are using a common point representation to make it as intuitive as possible.
These last two commands are just for clean up purposes. First we hide the cursor. Then we call the Turtle
done() method, which will keep the turtle graphics window with our map open until we choose to close it using the close handle at the top of the window.
Whether you followed along using the Python interpreter or you ran the complete program as a script, you should see the following map rendered in real time:
Congratulations! You have followed in the footsteps of Palaeolithic hunters, the "Father of GIS" Dr. Roger Tomlinson, geospatial pioneer Howard Fisher, and big-data rock star, Nate Silver to create a functional, extensible and technically complete geographic information system. And it took less than 60 lines of pure Python code! You will be hard pressed to find a programming language that can create a complete GIS using only its core libraries in such a finite amount of readable code like Python. And even if you did, it is highly unlikely that language would survive the geospatial Python journey you'll take through the rest of this book.
As you can see there is lots of room for expansion of
SimpleGIS. Here are some other ways you might expand this simple tool using the reference material for Tkinter and Python linked at the beginning of this section:
Create an overview map in the top-right corner with a US border outline and Colorado's location in the US
Add color for visual appeal and further clarity
Create a map key for different features
Make a states and cities list and add more states and cities
Add a title to the map
Create a bar chart to compare population numbers visually
The possibilities are endless.
SimpleGIS can also be used as a way to quickly test and visualize geospatial algorithms you come across. If you wanted to add more data layers you could create more lists but these lists would become difficult to manage. In that case you could use another Python module included in the standard distribution. The
sqlite module provides a SQL-like database in Python that can be saved to disk or run in memory.
Well done! You are now a geospatial analyst. In this chapter you learned:
The state of the art of geospatial analysis
The history of geospatial analysis and related technologies
Core GIS concepts, which will guide understanding unfamiliar concepts
Core remote sensing concepts applied throughout geospatial analysis
Common GIS and remote sensing processes
How to build the simplest possible GIS that works
In the next chapter, we'll tackle the data formats you'll encounter as geospatial analysts. Geospatial analysts spend far more time dealing with data than actually performing analysis. Understanding the data you're working with is essential to working efficiently and having fun.