[DOCS] add shapefiles documentation page (#1837)

apache · Mar 3, 2025 · cf60618 · cf60618
1 parent 5e6a673
commit cf60618
Show file tree

Hide file tree

Showing 3 changed files with 217 additions and 62 deletions.
diff --git a/docs/tutorial/files/shapefiles-sedona-spark.md b/docs/tutorial/files/shapefiles-sedona-spark.md
@@ -0,0 +1,215 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Shapefiles with Apache Sedona and Spark
+
+This post explains how to read Shapefiles with Apache Sedona and Spark.
+
+A Shapefile is “an Esri vector data storage format for storing the location, shape, and attributes of geographic features.”  The Shapefile format is proprietary, but [the spec is open](https://www.esri.com/content/dam/esrisites/sitecore-archive/Files/Pdfs/library/whitepapers/pdfs/shapefile.pdf).
+
+Shapefiles have many limitations but are extensively used, so it’s beneficial that they are readable by Sedona.
+
+Let’s look at how to read Shapefiles with Sedona and Spark.
+
+## Read Shapefiles with Sedona and Spark
+
+Let’s start by creating a Shapefile with GeoPandas and Shapely:
+
+```python
+import geopandas as gpd
+from shapely.geometry import Point
+
+point1 = Point(0, 0)
+point2 = Point(1, 1)
+
+data = {
+    'name': ['Point A', 'Point B'],
+    'value': [10, 20],
+    'geometry': [point1, point2]
+}
+
+gdf = gpd.GeoDataFrame(data, geometry='geometry')
+gdf.to_file("/tmp/my_geodata.shp")
+```
+
+Here are the files that are output:
+
+```
+/tmp/
+  my_geodata.cpg
+  my_geodata.dbf
+  my_geodata.shp
+  my_geodata.shx
+```
+
+Shapefiles are not stored in a single file.  They contain data in many different files.
+
+Here’s how to read a Shapefile into a Sedona DataFrame powered by Spark:
+
+```python
+df = sedona.read.format("shapefile").load("/tmp/my_geodata.shp")
+df.show()
+```
+
+```
++-----------+-------+-----+
+|   geometry|   name|value|
++-----------+-------+-----+
+|POINT (0 0)|Point A|   10|
+|POINT (1 1)|Point B|   20|
++-----------+-------+-----+
+```
+
+You can also see the unique record number for each row in the Shapefile as follows:
+
+```python
+df = (
+    sedona.read.format("shapefile")
+    .option("key.name", "FID")
+    .load("/tmp/my_geodata.shp")
+)
+```
+
+```
++-----------+---+-------+-----+
+|   geometry|FID|   name|value|
++-----------+---+-------+-----+
+|POINT (0 0)|  1|Point A|   10|
+|POINT (1 1)|  2|Point B|   20|
++-----------+---+-------+-----+
+```
+
+The name of the geometry column is geometry by default. You can change the name of the geometry column using the `geometry.name` option. Suppose one of the non-spatial attributes is named "geometry", `geometry.name` must be configured to avoid conflict.
+
+```python
+df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
+```
+
+The character encoding of string attributes are inferred from the `.cpg` file. If you see garbled values in string fields, you can manually specify the correct charset using the `charset` option. For example:
+
+=== "Scala/Java"
+
+    ```scala
+    val df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
+    ```
+
+=== "Java"
+
+    ```java
+    Dataset<Row> df = sedona.read().format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
+    ```
+
+=== "Python"
+
+    ```python
+    df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
+    ```
+
+Let’s see how to load many Shapefiles into a Sedona DataFrame.
+
+## Load many Shapefiles with Sedona
+
+Suppose you have a directory with many Shapefiles as follows:
+
+```
+/tmp/shapefiles/
+  file1.cpg
+  file1.dbf
+  file1.shp
+  file1.shx
+  file2.cpg
+  file2.dbf
+  file2.shp
+  file2.shx
+```
+
+The directory contains two `.shp` files and other supporting files.
+
+Here’s how to load many Shapefiles into a Sedona DataFrame:
+
+```python
+df = sedona.read.format("shapefile").load("/tmp/shapefiles")
+df.show()
+```
+
+```
++-----------+-------+-----+
+|   geometry|   name|value|
++-----------+-------+-----+
+|POINT (0 0)|Point A|   10|
+|POINT (1 1)|Point B|   20|
+|POINT (2 2)|Point C|   10|
+|POINT (3 3)|Point D|   20|
++-----------+-------+-----+
+```
+
+You can just pass the directory where the Shapefiles are stored, and the Sedona reader will pick them up.
+
+The input path can be a directory containing one or multiple Shapefiles or a path to a `.shp` file.
+
+* All shapefiles directly under the directory will be loaded when the input path is a directory. If you want to load all shapefiles in subdirectories, please specify `.option("recursiveFileLookup", "true")`.
+* The shapefile will be loaded when the input path is a .shp file. Sedona will look for sibling files (.dbf, .shx, etc.) with the same main file name and load them automatically.
+
+## Advantages of Shapefiles
+
+Shapefiles are deeply integrated into the Esri ecosystem and extensively used in many services.
+
+You can output a Shapefile from Esri and then read it with another engine like Sedona.
+
+However, Esri created the Shapefile format in the early 1990s, so it has many limitations.
+
+## Limitations of Shapefiles
+
+Here are some of the disadvantages of Shapefiles:
+
+* Don’t support complex geometries
+* They don’t support NULL values
+* They round numbers
+* Bad Unicode support
+* Don’t allow for long field names
+* 2GB file size limit
+* Spatial indexes are slower compared to alternatives
+* Unable to store datetimes
+
+See this page for more information on [the limitations of Shapefiles](http://switchfromshapefile.org/).
+
+Due to these limitations, other options are worth investigating.
+
+## Shapefile alternatives
+
+There are a variety of other file formats that are good for geometric data:
+
+* Iceberg
+* [GeoParquet](../geoparquet-sedona-spark)
+* FlatGeoBuf
+* [GeoPackage](../geopackage-sedona-spark)
+* [GeoJSON](../geojson-sedona-spark)
+* [CSV](../csv-geometry-sedona-spark)
+* GeoTIFF
+
+## Why Sedona does not support Shapefile writes
+
+Sedona does not write Shapefiles for two main reasons:
+
+1. Each Shapefile is a collection of files, which is hard for distributed systems to write.
+2. A Shapefile has a hard 2 GB size limit, which isn’t large enough for some spatial data.
+
+## Conclusion
+
+Shapefiles are a legacy file format still used in many production applications. However, they have many limitations and aren’t the best option in a modern data pipeline unless you need compatibility with legacy systems.
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
@@ -502,68 +502,7 @@ Since v`1.7.0`, Sedona supports loading Shapefile as a DataFrame.
 
 The input path can be a directory containing one or multiple shapefiles, or path to a `.shp` file.
 
-- When the input path is a directory, all shapefiles directly under the directory will be loaded. If you want to load all shapefiles in subdirectories, please specify `.option("recursiveFileLookup", "true")`.
-- When the input path is a `.shp` file, that shapefile will be loaded. Sedona will look for sibling files (`.dbf`, `.shx`, etc.) with the same main file name and load them automatically.
-
-The name of the geometry column is `geometry` by default. You can change the name of the geometry column using the `geometry.name` option. If one of the non-spatial attributes is named "geometry", `geometry.name` must be configured to avoid conflict.
-
-=== "Scala/Java"
-
-    ```scala
-    val df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
-    ```
-
-=== "Java"
-
-    ```java
-    Dataset<Row> df = sedona.read().format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
-    ```
-
-=== "Python"
-
-    ```python
-    df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
-    ```
-
-Each record in shapefile has a unique record number, that record number is not loaded by default. If you want to include record number in the loaded DataFrame, you can set the `key.name` option to the name of the record number column:
-
-=== "Scala/Java"
-
-    ```scala
-    val df = sedona.read.format("shapefile").option("key.name", "FID").load("/path/to/shapefile")
-    ```
-
-=== "Java"
-
-    ```java
-    Dataset<Row> df = sedona.read().format("shapefile").option("key.name", "FID").load("/path/to/shapefile")
-    ```
-
-=== "Python"
-
-    ```python
-    df = sedona.read.format("shapefile").option("key.name", "FID").load("/path/to/shapefile")
-    ```
-
-The character encoding of string attributes are inferred from the `.cpg` file. If you see garbled values in string fields, you can manually specify the correct charset using the `charset` option. For example:
-
-=== "Scala/Java"
-
-    ```scala
-    val df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
-    ```
-
-=== "Java"
-
-    ```java
-    Dataset<Row> df = sedona.read().format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
-    ```
-
-=== "Python"
-
-    ```python
-    df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
-    ```
+See [this page](../files/shapefile-sedona-spark) for more information on loading Shapefiles.
 
 ## Load GeoParquet
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -65,6 +65,7 @@ nav:
               - GeoPackage: tutorial/files/geopackage-sedona-spark.md
               - GeoParquet: tutorial/files/geoparquet-sedona-spark.md
               - GeoJSON: tutorial/files/geojson-sedona-spark.md
+              - Shapefiles: tutorial/files/shapefiles-sedona-spark.md
           - Map visualization SQL app:
               - Scala/Java: tutorial/viz.md
               - Use Apache Zeppelin: tutorial/zeppelin.md