Further updates to README and one more data prep. script

mgoddard · mgoddard · commit 1289ba04e08e · 2020-11-12T05:50:36.000-05:00
diff --git a/README.md b/README.md
@@ -38,16 +38,27 @@ various sizes.  A deeper discussion of this topic is available
 <img src="./mobile_view.png" width="360" alt="Running on iPhone">
 (App running in an iPhone, in Safari)
 
-## Setup
-
 The demo can be run locally, in a Docker container, or in K8s.
 
-[Data set](https://storage.googleapis.com/crl-goddard-gis/osm_1m_eu.txt.gz): 1m
-points from OpenStreetMap's Planet Dump, all in Europe
+## Data
+
+The data set (see below) is a sample of an extract of the OpenStreetMap
+Planet Dump which is accessible from [here](https://wiki.openstreetmap.org/wiki/Planet.osm).
+The `planet-latest.osm.pbf` file was downloaded (2020-08-01) and then processed
+using [Osmosis](https://github.com/openstreetmap/osmosis/releases) as
+documented in [this script](./osm/planet_osm_extract.sh).  The bounding box
+specified for the extract was `--bounding-box top=72.253800 left=-12.666450 bottom=33.120960 right=34.225994`,
+corresponding to the area shown in the figure below.  The result of this operation
+was a 36 GB Bzip'd XML file (not included here).  This intermediate file was then
+processed using [this Perl script](./osm/extract_points_from_osm_xml.pl), with the
+result being piped through Gzip to produce a [smaller data
+set](https://storage.googleapis.com/crl-goddard-gis/osm_1m_eu.txt.gz) consisting of 1 million points.
+
+![Boundary of OSM data extract](./osm/OSM_extracted_region.jpg)
+
+[DDL and sample SQL queries](./osm/osm_crdb.sql): The data set is loaded into
+one table which has a primary key and one secondary index.  Here is the DDL:
 
-[DDL and sample SQL queries](./osm_crdb.sql): The above mentioned data set is
-loaded into one table which has a primary key and one secondary index.  Here is
-the DDL:
 ```
 DROP TABLE IF EXISTS osm;
 CREATE TABLE osm
@@ -63,7 +74,6 @@ CREATE TABLE osm
 );
 CREATE INDEX ON osm USING GIN(ref_point);
 ```
-
 **NOTE:** `./load_osm_stdin.py` will create this table and GIN index if they don't already exist.
 
 Load the data (see above) using [this script](./load_osm_stdin.py) as follows,
diff --git a/osm/extract_points_from_osm_xml.pl b/osm/extract_points_from_osm_xml.pl
@@ -0,0 +1,80 @@
+#!/usr/bin/env perl
+
+use strict;
+use Geo::Hash;
+ 
+my $num_to_find = 100_000_000;
+
+my $col_sep = "<"; # Illegal in the XML, so safe to use as delimiter
+my @col_names = qw( id timestamp uid lat lon name key_value );
+my %cols = ();
+$/  = "</node>\n";
+
+sub reset_cols {
+  %cols = map { $_ => undef } @col_names;
+}
+
+# - Get a line, a "...</node>", look for leading "<node [^/]+?>", parse it.
+# - Split the remainder on the "\n", looking at each "<tag ..>", to fill %col.
+# - Only print the result if the place has a name.
+# - Any other "<tag ..." data will be placed in final column, "key_value": k1=v1|k2=v2|...
+
+reset_cols();
+my $num_found = 0;
+my $gh = Geo::Hash->new;
+while (<>)
+{
+  last if $num_found >= $num_to_find;
+  my $geohash = "";
+  if (/<node +id="(\d+)" +version="\d+" +timestamp="([^"]+)" +uid="(\d+)" +user="[^"]+" +changeset="\d+" +lat="([^"]+)" +lon="([^"]+)">/) {
+    $cols{id} = $1;
+    $cols{timestamp} = $2;
+    $cols{uid} = $3;
+    $cols{lat} = $4;
+    $cols{lon} = $5;
+    my $lat_lon = join(',', $4, $5);
+    $geohash = $gh->encode($cols{lat}, $cols{lon});
+
+    my @key_value = ();
+    foreach my $tag (split /\n/)
+    {
+      if ($tag =~ m~<tag +k="([^"]+)" v="([^"]+)"\s*/>~) {
+        if (exists $cols{$1}) {
+          $cols{$1} = $2;
+        } else {
+          (my $k = $1) =~ s/[\|=]/~/g;
+          (my $v = $2) =~ s/[\|=]/~/g;
+          push @key_value, $k . "=" . $v;
+        }
+      }
+    }
+    # Append a few geohash substrings
+    for (my $i = 3; $i < 7; $i++)
+    {
+      push(@key_value, substr($geohash, 0, $i));
+    }
+    $cols{key_value} = join("|", @key_value);
+  }
+  if ($cols{name}) {
+    $num_found += 1;
+    print join($col_sep, @cols{@col_names}, $geohash) . "\n";
+    print STDERR "N rows: $num_found\n" if $num_found % 1000 == 0;
+  }
+  reset_cols();
+}
+
+__DATA__
+<node id="271251" version="4" timestamp="2009-09-09T22:16:06Z" uid="169366" user="Tunafish" changeset="2430548" lat="50.8052" lon="-1.67253">
+    <tag k="name" v="Station House"/>
+    <tag k="amenity" v="restaurant"/>
+    <tag k="cuisine" v="tea;restaurant"/>
+</node>
+<node id="4082701" version="4" timestamp="2013-10-22T06:40:09Z" uid="453141" user="ppr9" changeset="18480867" lat="52.1602045" lon="-0.4921953">
+    <tag k="name" v="Bellini&apos;s"/>
+    <tag k="amenity" v="restaurant"/>
+    <tag k="wheelchair" v="yes"/>
+    <tag k="addr:street" v="High Street"/>
+    <tag k="addr:postcode" v="MK41 6EG"/>
+    <tag k="addr:housenumber" v="44,46"/>
+</node>
+