PostGIS Cookbook — Save 50%
Over 80 task-based recipes to store, organize, manipulate, and analyze spatial data in a PostGIS database with this book and ebook
In this article by Paolo Corti, Thomas J Kraft, Stephen Vincent Mather, and Bborie Park, authors of PostGIS Cookbook, you will learn how to make use of KNN filters to increase the performance of proximity queries.
PostGIS Cookbook uses a problem-solving approach to help you acquire a solid understanding of PostGIS. Hopefully, this book provides answers to some common spatial questions and gives you the inspiration and confidence to use and enhance PostGIS in finding solutions to challenging spatial problems.
(For more resources related to this topic, see here.)
The basic question that we seek to answer in this article is the fundamental distance question, "Which are the closest (name what you are searching for) to me?", for example, "Which are the five coffee shops closest to me?" It turns out that while it is a fundamental question, it's not always easy to answer, though we will make this possible in this article. We will approach this with two approaches. The first way in which we'll approach this is in a simple heuristic, which will allow us to come to a solution quickly. Then, we'll take advantage of the deeper PostGIS functionality to make the solution faster and more general with a K-Nearest Neighbor (KNN) approach.
A concept that we need to understand from the outset is that of a spatial index. A spatial index, like other database indexes, functions like a book index. It is a special construct to make looking for things inside our table easier, much in the way a book index helps us find content in a book faster. In the case of a spatial index, it helps us find faster where things are in space. Therefore, by using a spatial index in our geographic searches, we can speed up our searches by many orders of magnitude.
To learn more about spatial indexes, see http://en.wikipedia.org/wiki/Spatial_index#Spatial_index.
We will start by loading our data. Our data are the address records from Cuyahoga County, Ohio, USA.
shp2pgsql -s 3734 -d -i -I -W LATIN1 -g the_geom CUY_ADDRESS_POINTS chp04.knn_addresses | psql -U me -d postgis_cookbook
As this dataset may take a while to load, you can alternatively load a subset.
shp2pgsql -s 3734 -d -i -I -W LATIN1 -g the_geom CUY_ADDRESS_POINTS_ subset chp04.knn_addresses | psql -U me -d postgis_cookbook
We specified the -I flag in order to request a spatial index be created upon the import of this data.
Let us start by seeing how many records we are dealing with:
SELECT COUNT(*) FROM chp04.knn_addresses; --484958
We have, in this address table, almost half a million address records, which is not an insubstantial number against which to perform a query.
How to do it...
KNN is an approach to searching for an arbitrary number of points closest to a given point. Without the right tools, this can be a very slow process that requires testing the distance between the point of interest and all the possible neighbors. The problem with this approach is that the search becomes exponentially slower as the number of points increases. Let's start with this naïve approach and then improve upon it.
Suppose we were interested in finding 10 records closest to the geographic location, -81.738624, 41.396679. The naïve approach would be to transform this value into our local coordinate system and compare the distance to each point in the database from the search point, order those values by distance, and limit the search to the first 10 closest records (it is not recommended that you run the following query—it could run indefinitely)
SELECT ST_Distance(searchpoint.the_geom, addr.the_geom) AS dist, * FROM chp04.knn_addresses addr, (SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-81.738624, 41.396679), 4326), 3734) AS the_geom) searchpoint ORDER BY ST_Distance(searchpoint.the_geom, addr.the_geom) LIMIT 10;
This is a fine approach for smaller datasets. This is a logical, simple, fast approach for relatively small numbers of records. This approach scales very poorly, however, getting exponentially slower with the addition of records, and with 500,000 points, this would take a very long time.
An alternative is to only compare my point to the ones I know are close by setting a search distance. So, for example, in the following diagram, we have a star that represents my current location, and I want to know the 10 closest addresses. The grid in the diagram is a 100 foot grid, so I can search for the points within 200 feet, then measure the distance to each of these points, and return the closest 10 points to my search location.
So far, our approach to answering this question is to limit the search using the ST_DWithin operator to only search for records within a certain distance. ST_DWithin uses our spatial index, so the initial distance search is fast and the list of returned records should be short enough to do the same pair-wise distance comparison we did earlier in this section. In our case here, we could limit the search to within 200 feet as follows.
SELECT ST_Distance(searchpoint.the_geom, addr.the_geom) AS dist, * FROM chp04.knn_addresses addr, (SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-81.738624, 41.396679), 4326), 3734) AS the_geom) searchpoint WHERE ST_DWithin(searchpoint.the_geom, addr.the_geom, 200) ORDER BY ST_Distance(searchpoint.the_geom, addr.the_geom) LIMIT 10;
This approach performs well so long as our search window, ST_DWithin, is the right size for the data. The problem with this approach is that, in order to optimize it, we need to know how to set a search window that is about the right size. Any larger than the right size and the query will run more slowly than we'd like. Any smaller than the right size and we might not get all the points back that we need. Inherently, we don't know this ahead of time, so we can only hope for the best guess.
In this same dataset, if we apply the same query in another location, the output will return no points because the 10 closest points are further than 200 feet away. We can see this in the following diagram:
Fortunately, for PostGIS 2.0, we can leverage the distance operators (<-> and <#>) to do indexed nearest-neighbor searches. This makes for very fast KNN searches that don't require us to guess ahead of time how far away we need to search. Why are the searches fast? The spatial index helps, of course, but in the case of the distance operator, we are using the structure of the index itself, which is hierarchical, to very quickly sort our neighbors.
When used in an ORDER BY clause, the distance operator uses the index:
SELECT ST_Distance(searchpoint.the_geom, addr.the_geom) AS dist, * FROM chp04.knn_addresses addr, (SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-81.738624, 41.396679), 4326), 3734) AS the_geom) searchpoint ORDER BY addr.the_geom <-> searchpoint.the_geom LIMIT 10;
This approach requires no a priori knowledge of how far the nearest neighbors might be. It also scales very well, returning thousands of records in not more than the time it takes to return a few records. It is sometimes slower than using ST_DWithin, depending on how small our search distance is and how large the dataset we are dealing with. But the tradeoff is that we don't need to make a guess as to our correct search distance and for large queries, it can be much faster than the naïve approach.
How it works...
What makes this magic possible is that PostGIS uses an R-Tree index. This means that the index itself is sorted hierarchically based on spatial information. As demonstrated, we can leverage the structure of the index in sorting distances from a given arbitrary location and, thus, use the index to directly return the sorted records. This means that the structure of the spatial index itself helps us answer such fundamental questions quickly and inexpensively.
More information about KNN and R-tree can be found at http://workshops.boundlessgeo.com/postgis-intro/knn.html and https://en.wikipedia.org/wiki/R-tree.
This article covered the introduction and learning of KNN filters to increase the performance of proximity queries. Now you must be quite familiar with the usage of KNN filters to increase the performance of proximity queries.
Resources for Article:
- PostgreSQL: Tips and Tricks [article]
- Backup in PostgreSQL 9 [article]
- Python Libraries for Geospatial Development [article]
|Over 80 task-based recipes to store, organize, manipulate, and analyze spatial data in a PostGIS database with this book and ebook|
eBook Price: £18.99
Book Price: £30.99
About the Author :
Bborie Park has been breaking (and subsequently fixing) computers for most of his life. His primary interests involve developing end-to-end pipelines for spatial datasets. He is an active contributor to the PostGIS project and is a member of the PostGIS Steering Committee. He happily resides with his wife Nicole in the San Francisco Bay Area.
Paolo Corti is based in Rome, Italy. He is an environmental engineer with more than 15 years of experience in the GIS sector. After working with proprietary solutions for some years, he has proudly switched to open-source technologies and Python for almost a decade.
He has been working as a developer and analyst for organizations such as the EU Joint Research Center, UN World Food Program, and the Italian Government.
Currently, he is working within the GeoNode project, for which he is a core developer, in the context of emergency preparedness and response.
He is an OSGeo Charter member and writes a blog on open-source GIS at http://www.paolocorti.net/.
He is the author of the book's chapters 1, 3, 8, and 9.
Stephen Vincent Mather has worked in the geospatial industry for 15 years, having always had a flair for geospatial analyses in general, especially those at the intersection of Geography and Ecology. His work in open-source geospatial databases started 5 years ago with PostGIS and he immediately began using PostGIS as an analytic tool, attempting a range of innovative and sometimes bleeding-edge techniques (although he admittedly prefers the cutting edge). His geospatial career has spanned a variety of interesting and novel natural-resource projects, everything from the movement of ice sheets in Antarctica to hiking viewsheds and mobile trail applications to help park users find trails, picnic areas, and restrooms.
Stephen is currently the GIS manager for Cleveland Metroparks in Cleveland, Ohio. He manages a small geospatial shop that specializes in high-end cartography, crating and generating data, geospatial web development, and analyses for natural-resource management, largely with open-source software.
Stephen is also a Mennonite technologist, aka a straw-hat hacker, interested in creating fair and open data and infrastructure for better governance and humanitarian purposes. He is heavily involved in the Cleveland Civic Hacking movement as he works with the public to help them get engaged with geospatial data. In his spare time, he builds guitars really, really slowly.
Thomas J Kraft is currently a Planning Technician at Cleveland Metroparks after beginning as a GIS intern in 2011.
He graduated with Honors from Cleveland State University in 2012, majoring in Environmental Science with an emphasis on GIS.
When not in front of a computer, he enjoys his weekends landscaping and the outdoors in general.