Skip to content

Commit

Permalink
[SEDONA-713] add OSM PBF reader (#1823)
Browse files Browse the repository at this point in the history
* Add OSM PBF reader.

Add documentation.

Add documentation.

Add documentation.

Add documentation.

Add documentation.

SEDONA-713 moving to common.

* SEDONA-713 Add docs.

* SEDONA-713 Add docs.

* SEDONA-713 Add docs.

* SEDONA-714 Add docs.
  • Loading branch information
Imbruced authored Feb 26, 2025
1 parent ebd6f67 commit fba7d75
Show file tree
Hide file tree
Showing 39 changed files with 22,193 additions and 110 deletions.
1 change: 1 addition & 0 deletions .github/linters/codespell.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ limite
nd
ois
parm
pavin
pixelx
refere
ser
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ repos:
name: run codespell
description: check spelling with codespell
args: [--ignore-words=.github/linters/codespell.txt]
exclude: ^docs/image|^spark/common/src/test/resources|^docs/usecases|^tools/maven/scalafmt
exclude: ^docs/image|^spark/common/src/test/resources|^docs/usecases|^tools/maven/scalafmt|osmpbf/build
- repo: https://github.com/gitleaks/gitleaks
rev: v8.23.3
hooks:
Expand Down
96 changes: 96 additions & 0 deletions docs/tutorial/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -757,6 +757,102 @@ You can also load data from raster tables in the geopackage file. To load raster
+---+----------+-----------+--------+--------------------+
```

## Load from OSM PBF

Since v1.7.1, Sedona supports loading OSM PBF file format as a DataFrame.

=== "Scala/Java"

```scala
val df = sedona.read.format("osmpbf").load("/path/to/osmpbf")
```

=== "Java"

```java
Dataset<Row> df = sedona.read().format("osmpbf").load("/path/to/osmpbf")
```

=== "Python"

```python
df = sedona.read.format("osmpbf").load("/path/to/osmpbf")
```

OSM PBF files can contain nodes, ways, and relations. Currently Sedona support
DenseNodes, Ways and Relations. When you load the data you get a DataFrame with the following schema.

```
root
|-- id: long (nullable = true)
|-- kind: string (nullable = true)
|-- location: struct (nullable = true)
| |-- longitude: double (nullable = true)
| |-- latitude: double (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- refs: array (nullable = true)
| |-- element: long (containsNull = true)
|-- ref_roles: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ref_types: array (nullable = true)
| |-- element: string (containsNull = true)
```

Where:

- `id` is the unique identifier of the object.
- `kind` is the type of the object, it can be `node`, `way` or `relation`.
- `location` is the location of the object, it contains the `longitude` and `latitude` of the object.
- `tags` is a map of key-value pairs that represent the tags of the object.
- `refs` is an array of the references of the object.
- `ref_roles` is an array of the roles of the references.
- `ref_types` is an array of the types of the references.

The dataframe for ways might look like this for nodes:

```
+---------+----+--------------------+--------------------+----+---------+---------+
| id|kind| location| tags|refs|ref_roles|ref_types|
+---------+----+--------------------+--------------------+----+---------+---------+
|248675410|node|{21.0884952545166...|{tactile_paving -...|NULL| NULL| NULL|
|260821820|node|{21.0191555023193...|{created_by -> JOSM}|NULL| NULL| NULL|
|349189665|node|{22.1437530517578...|{source -> http:/...|NULL| NULL| NULL|
|353366899|node|{22.9787712097167...|{source -> http:/...|NULL| NULL| NULL|
|359460224|node|{22.4816703796386...|{source -> http:/...|NULL| NULL| NULL|
+---------+----+--------------------+--------------------+----+---------+---------+
only showing top 5 rows
```

and for way

```
+-------+----+--------+--------------------+--------------------+---------+---------+
| id|kind|location| tags| refs|ref_roles|ref_types|
+-------+----+--------+--------------------+--------------------+---------+---------+
|4307329| way| NULL|{junction -> roun...|[2448759046, 7093...| NULL| NULL|
|4307330| way| NULL|{surface -> aspha...|[26063923, 260639...| NULL| NULL|
|4308966| way| NULL|{sidewalk -> sepa...|[3387797238, 9252...| NULL| NULL|
|4308968| way| NULL|{surface -> pavin...|[26083890, 744724...| NULL| NULL|
|4308969| way| NULL|{cycleway:both ->...|[9526831176, 1218...| NULL| NULL|
+-------+----+--------+--------------------+--------------------+---------+---------+
```

and for relation

```
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
| id| kind|location| tags| refs| ref_roles| ref_types|
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
|28124|relation| NULL|{official_name ->...|[26382394, 26259985]| [inner, outer]| [WAY, WAY]|
|28488|relation| NULL| {type -> junction}|[26409253, 303249...|[roundabout, roun...|[WAY, WAY, WAY, WAY]|
|32939|relation| NULL|{ref -> E 67, rou...|[140673970, 14067...| [, , , , , ]|[WAY, WAY, RELATI...|
|34387|relation| NULL|{note -> rząd III...|[209161000, 52154...|[main_stream, mai...|[WAY, WAY, WAY, W...|
|34392|relation| NULL|{distance -> 1047...|[150033976, 25076...|[main_stream, mai...|[WAY, WAY, WAY, W...|
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
```

Known limitations (v1.7.0):

- webp rasters are not supported
Expand Down
2 changes: 2 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,7 @@
<module>common</module>
<module>spark</module>
<module>spark-shaded</module>
<module>shade-proto</module>
</modules>
</profile>
<profile>
Expand All @@ -788,6 +789,7 @@
<module>flink</module>
<module>flink-shaded</module>
<module>snowflake</module>
<module>shade-proto</module>
</modules>
</profile>
<profile>
Expand Down
89 changes: 89 additions & 0 deletions shade-proto/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.sedona</groupId>
<artifactId>sedona-parent</artifactId>
<version>1.7.1-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>

<artifactId>shade-proto</artifactId>
<name>${project.groupId}:${project.artifactId}</name>
<url>http://sedona.apache.org/</url>
<packaging>jar</packaging>

<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>4.28.0</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/java</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.5.0</version>
<configuration>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>proto4</shadedPattern>
</relocation>
</relocations>
<createDependencyReducedPom>false</createDependencyReducedPom>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>attach-artifact</goal>
</goals>
<configuration>
<artifacts>
<artifact>
<file>${project.build.directory}/${project.build.finalName}.jar</file>
<type>jar</type>
<classifier>shaded</classifier>
</artifact>
</artifacts>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
42 changes: 42 additions & 0 deletions spark/common/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,18 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.sedona</groupId>
<artifactId>shade-proto</artifactId>
<version>${project.version}</version>
<classifier>shaded</classifier>
<exclusions>
<exclusion>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
Expand Down Expand Up @@ -94,6 +106,12 @@
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<exclusions>
<exclusion>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.xerial</groupId>
Expand Down Expand Up @@ -220,6 +238,30 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>minio</artifactId>
<version>1.20.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.minio</groupId>
<artifactId>minio</artifactId>
<version>8.5.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-api</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>
<dependency> <!-- Generally this will be provided by the runtime's spark install -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_${scala.compat.version}</artifactId>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.sedona.sql.datasources.osmpbf;

import java.util.Iterator;
import org.apache.sedona.sql.datasources.osmpbf.build.Osmformat;
import org.apache.sedona.sql.datasources.osmpbf.extractors.DenseNodeExtractor;
import org.apache.sedona.sql.datasources.osmpbf.model.OsmNode;

public class DenseNodeIterator implements Iterator<OsmNode> {
Osmformat.StringTable stringTable;
int idx;
long nodesSize;
DenseNodeExtractor extractor;

public DenseNodeIterator(
long nodesSize, Osmformat.StringTable stringTable, DenseNodeExtractor extractor) {
this.stringTable = stringTable;
this.nodesSize = nodesSize;
this.idx = 0;
this.extractor = extractor;
}

@Override
public boolean hasNext() {
return idx < nodesSize;
}

@Override
public OsmNode next() {
OsmNode node = extractor.extract(idx, stringTable);
idx += 1;
return node;
}
}
Loading

0 comments on commit fba7d75

Please sign in to comment.