Skip to content

Commit 1289ba0

Browse files
author
mgoddard
committed
Further updates to README and one more data prep. script
1 parent b8fdd75 commit 1289ba0

File tree

2 files changed

+98
-8
lines changed

2 files changed

+98
-8
lines changed

README.md

+18-8
Original file line numberDiff line numberDiff line change
@@ -38,16 +38,27 @@ various sizes. A deeper discussion of this topic is available
3838
<img src="./mobile_view.png" width="360" alt="Running on iPhone">
3939
(App running in an iPhone, in Safari)
4040

41-
## Setup
42-
4341
The demo can be run locally, in a Docker container, or in K8s.
4442

45-
[Data set](https://storage.googleapis.com/crl-goddard-gis/osm_1m_eu.txt.gz): 1m
46-
points from OpenStreetMap's Planet Dump, all in Europe
43+
## Data
44+
45+
The data set (see below) is a sample of an extract of the OpenStreetMap
46+
Planet Dump which is accessible from [here](https://wiki.openstreetmap.org/wiki/Planet.osm).
47+
The `planet-latest.osm.pbf` file was downloaded (2020-08-01) and then processed
48+
using [Osmosis](https://github.com/openstreetmap/osmosis/releases) as
49+
documented in [this script](./osm/planet_osm_extract.sh). The bounding box
50+
specified for the extract was `--bounding-box top=72.253800 left=-12.666450 bottom=33.120960 right=34.225994`,
51+
corresponding to the area shown in the figure below. The result of this operation
52+
was a 36 GB Bzip'd XML file (not included here). This intermediate file was then
53+
processed using [this Perl script](./osm/extract_points_from_osm_xml.pl), with the
54+
result being piped through Gzip to produce a [smaller data
55+
set](https://storage.googleapis.com/crl-goddard-gis/osm_1m_eu.txt.gz) consisting of 1 million points.
56+
57+
![Boundary of OSM data extract](./osm/OSM_extracted_region.jpg)
58+
59+
[DDL and sample SQL queries](./osm/osm_crdb.sql): The data set is loaded into
60+
one table which has a primary key and one secondary index. Here is the DDL:
4761

48-
[DDL and sample SQL queries](./osm_crdb.sql): The above mentioned data set is
49-
loaded into one table which has a primary key and one secondary index. Here is
50-
the DDL:
5162
```
5263
DROP TABLE IF EXISTS osm;
5364
CREATE TABLE osm
@@ -63,7 +74,6 @@ CREATE TABLE osm
6374
);
6475
CREATE INDEX ON osm USING GIN(ref_point);
6576
```
66-
6777
**NOTE:** `./load_osm_stdin.py` will create this table and GIN index if they don't already exist.
6878

6979
Load the data (see above) using [this script](./load_osm_stdin.py) as follows,

osm/extract_points_from_osm_xml.pl

+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#!/usr/bin/env perl
2+
3+
use strict;
4+
use Geo::Hash;
5+
6+
my $num_to_find = 100_000_000;
7+
8+
my $col_sep = "<"; # Illegal in the XML, so safe to use as delimiter
9+
my @col_names = qw( id timestamp uid lat lon name key_value );
10+
my %cols = ();
11+
$/ = "</node>\n";
12+
13+
sub reset_cols {
14+
%cols = map { $_ => undef } @col_names;
15+
}
16+
17+
# - Get a line, a "...</node>", look for leading "<node [^/]+?>", parse it.
18+
# - Split the remainder on the "\n", looking at each "<tag ..>", to fill %col.
19+
# - Only print the result if the place has a name.
20+
# - Any other "<tag ..." data will be placed in final column, "key_value": k1=v1|k2=v2|...
21+
22+
reset_cols();
23+
my $num_found = 0;
24+
my $gh = Geo::Hash->new;
25+
while (<>)
26+
{
27+
last if $num_found >= $num_to_find;
28+
my $geohash = "";
29+
if (/<node +id="(\d+)" +version="\d+" +timestamp="([^"]+)" +uid="(\d+)" +user="[^"]+" +changeset="\d+" +lat="([^"]+)" +lon="([^"]+)">/) {
30+
$cols{id} = $1;
31+
$cols{timestamp} = $2;
32+
$cols{uid} = $3;
33+
$cols{lat} = $4;
34+
$cols{lon} = $5;
35+
my $lat_lon = join(',', $4, $5);
36+
$geohash = $gh->encode($cols{lat}, $cols{lon});
37+
38+
my @key_value = ();
39+
foreach my $tag (split /\n/)
40+
{
41+
if ($tag =~ m~<tag +k="([^"]+)" v="([^"]+)"\s*/>~) {
42+
if (exists $cols{$1}) {
43+
$cols{$1} = $2;
44+
} else {
45+
(my $k = $1) =~ s/[\|=]/~/g;
46+
(my $v = $2) =~ s/[\|=]/~/g;
47+
push @key_value, $k . "=" . $v;
48+
}
49+
}
50+
}
51+
# Append a few geohash substrings
52+
for (my $i = 3; $i < 7; $i++)
53+
{
54+
push(@key_value, substr($geohash, 0, $i));
55+
}
56+
$cols{key_value} = join("|", @key_value);
57+
}
58+
if ($cols{name}) {
59+
$num_found += 1;
60+
print join($col_sep, @cols{@col_names}, $geohash) . "\n";
61+
print STDERR "N rows: $num_found\n" if $num_found % 1000 == 0;
62+
}
63+
reset_cols();
64+
}
65+
66+
__DATA__
67+
<node id="271251" version="4" timestamp="2009-09-09T22:16:06Z" uid="169366" user="Tunafish" changeset="2430548" lat="50.8052" lon="-1.67253">
68+
<tag k="name" v="Station House"/>
69+
<tag k="amenity" v="restaurant"/>
70+
<tag k="cuisine" v="tea;restaurant"/>
71+
</node>
72+
<node id="4082701" version="4" timestamp="2013-10-22T06:40:09Z" uid="453141" user="ppr9" changeset="18480867" lat="52.1602045" lon="-0.4921953">
73+
<tag k="name" v="Bellini&apos;s"/>
74+
<tag k="amenity" v="restaurant"/>
75+
<tag k="wheelchair" v="yes"/>
76+
<tag k="addr:street" v="High Street"/>
77+
<tag k="addr:postcode" v="MK41 6EG"/>
78+
<tag k="addr:housenumber" v="44,46"/>
79+
</node>
80+

0 commit comments

Comments
 (0)