Production Concerns

Geohashing & Quadtrees

● Advanced ⏱ 13 min read production

Location-based services — find nearby drivers, restaurants within 2 km, points of interest on a map — require efficient proximity queries over geographic data. A naive approach (calculate distance from every location in the database) doesn’t scale. Geohashing and quadtrees are two spatial indexing techniques that solve this by transforming 2D geographic coordinates into structures that standard databases and distributed systems can query efficiently.

The Proximity Search Problem

Given a user at latitude 37.7749, longitude −122.4194 (San Francisco), find all restaurants within 5 km. The database contains 10 million restaurant locations worldwide.

A brute-force approach — compute the haversine distance from the user to every restaurant and filter those under 5 km — requires 10 million distance calculations per query. At scale, with thousands of concurrent users, this is unusable.

A naive SQL bounding box (WHERE lat BETWEEN x1 AND x2 AND lng BETWEEN y1 AND y2) helps, but a 2D range query on two separate indexed columns requires the database to intersect two range scans — much less efficient than a single-column index lookup. The ideal solution converts the 2D problem into a 1D problem that standard B-tree indexes can handle efficiently.

Geohashing

Geohashing encodes a latitude/longitude pair into a short alphanumeric string by recursively dividing the world into a grid of rectangular cells. The encoding works as follows:

  1. Start with the entire world: latitude −90 to 90, longitude −180 to 180.
  2. Divide into two halves along longitude. If the point is in the left half, encode 0; right half, encode 1.
  3. Divide into two halves along latitude. Bottom half → 0; top half → 1.
  4. Repeat, alternating longitude and latitude splits, encoding a bit at each step.
  5. Group the bits in 5-bit chunks and map each chunk to a Base32 character (0–9, b–z minus a, i, l, o).

The result: a string where longer strings represent smaller, more precise cells, and any prefix of the string represents the containing larger cell.

San Francisco (37.7749, -122.4194):
  Geohash: 9q8yy   (precision 5 — roughly 4.9 km × 4.9 km cell)
  Geohash: 9q8yyh  (precision 6 — roughly 1.2 km × 0.6 km cell)
  Geohash: 9q8yyhb (precision 7 — roughly 153 m × 153 m cell)

Points in the same geohash cell share the same prefix. All restaurants with geohash starting with 9q8yy are in the same roughly 5 km × 5 km area. A string prefix query (WHERE geohash LIKE '9q8yy%') on a B-tree index is a single efficient range scan.

Geohashing divides the world into a hierarchical grid; quadtrees subdivide space adaptively based on point density

Precision and Cell Size

Precision (chars)Cell widthCell heightUse case
1~5,000 km~5,000 kmCountry-level
3~156 km~156 kmState/region level
5~4.9 km~4.9 kmNeighborhood level
6~1.2 km~0.6 kmCity block level
7~153 m~153 mStreet-level
9~2.4 m~4.8 mBuilding-level

Choose precision based on your query radius. For "find restaurants within 1 km," precision 6 (1.2 km cells) is appropriate — query the user’s cell and its 8 neighbors to cover the full radius. For "find drivers within 500 m," precision 7 (153 m cells) with neighbor expansion works well.

A critical property of geohashing: two nearby points share a long common prefix — but not always. A point near a cell boundary has a similar geohash to other points in its own cell, but a very different geohash from points just across the boundary in a neighboring cell, even if those points are physically very close.

The solution: always query the target cell and its 8 neighbors. Every geohash cell has 8 adjacent cells, and their geohash strings can be computed directly from the target geohash.

// Python pseudocode
def nearby_restaurants(lat, lng, radius_km):
    precision = geohash_precision_for_radius(radius_km)  # e.g., 6
    center_hash = geohash.encode(lat, lng, precision)
    neighbors = geohash.neighbors(center_hash)  # 8 adjacent cells
    cells_to_query = [center_hash] + neighbors   # 9 cells total

    results = db.query(
        "SELECT * FROM restaurants WHERE geohash6 IN (?)",
        cells_to_query
    )
    # Filter by exact distance to exclude corners of bounding cells
    return [r for r in results if haversine(lat, lng, r.lat, r.lng) <= radius_km]

The database query uses a fast IN-list lookup on a geohash column index, returning a small candidate set. The final haversine distance filter eliminates false positives from cell corners. This two-step approach — coarse index filter then precise distance check — is efficient even at scale.

Quadtrees

A quadtree is a tree data structure where each node represents a rectangular region of space and has exactly four children, each representing one quadrant (NW, NE, SW, SE). The tree is built adaptively: a region is subdivided only when it contains more than a threshold number of points.

Building a quadtree:

  1. Start with one node covering the entire world.
  2. Insert points one by one. When a node exceeds the threshold (e.g., 100 points), split it into four child nodes and distribute its points among them.
  3. Repeat until all nodes are under the threshold or the minimum cell size is reached.

The key difference from geohashing: quadtrees adapt to data density. Dense urban areas are subdivided many times (fine-grained cells); sparse rural areas remain as large coarse cells. Geohash cells are always the same size at a given precision, regardless of how many points they contain.

Proximity search with a quadtree: Traverse the tree from the root, descending into any node whose bounding box intersects the search radius. Collect all points in intersecting leaf nodes, then filter by exact distance. The traversal is O(log N + k) where k is the number of candidate points.

Quadtrees are typically stored in memory (the entire tree for a city fits comfortably in a few hundred MB) or in a distributed cache. They’re especially suited for workloads with highly uneven geographic distribution — which is most real-world location data.

Geohash vs Quadtree

GeohashQuadtree
StorageColumn in database (string)In-memory tree or dedicated store
Cell sizeFixed per precision levelAdaptive to data density
Index typeB-tree index on geohash stringTree traversal in memory
Query complexityIN-list lookup (9 cells) + filterTree traversal + filter
Database integrationEasy — standard SQL columnHarder — requires in-memory structure
Uneven densityHot cells with many pointsNaturally handles dense regions
UpdatesSimple column updateTree rebalancing on insert/delete

Real-World Use

Uber: Uses a geohash-based approach to partition the city into cells. Drivers broadcast their location updates; the dispatch system routes requests to drivers in the same cell or adjacent cells. The geohash cell is also the unit of surge pricing calculation.

Yelp / Google Maps: Geohash or similar spatial indexing for "restaurants near me" queries. PostGIS (the PostgreSQL geospatial extension) uses R-trees internally, a generalization of quadtrees, for spatial queries.

Twitter / Instagram: Geohash-based location tagging allows fetching posts geotagged within a region efficiently — prefix search on geohash covers the bounding box.

PostGIS and spatial databases: PostgreSQL + PostGIS, MySQL spatial, and MongoDB support geospatial indexes natively — under the hood, these use R-trees or similar space-partitioning structures. Use these for general-purpose geospatial queries rather than implementing your own quadtree.

Design Considerations