[WIP] Add Geography user-defined type #1811

paleolimbot · 2025-02-13T22:33:04Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-711. The PR name follows the format [SEDONA-711] my subject.

What changes were proposed in this PR?

This PR adds a GeographyUDT to match the GeometryUDT. This matches how the types are defined in Iceberg and Parquet, and maps to the concepts of "edges": "spherical" in GeoParquet and GeoArrow.

I'm happy to discuss the scope of the initial implementation in this PR...I'd propose starting with a thin wrapper around a JTS geometry (After all: coordinates are coordinates! And many structural operations are identical in a geometry/geography context). Things like serializers and WKB/WKT IO are also nicely defined for a Geometry already and I don't think there's a strong case for reinventing those in Java land (but feel free to correct me!).

How was this patch tested?

Work in progress!

Did this PR include necessary documentation updates?

Work in progress!

jiayuasu · 2025-02-13T23:42:43Z

common/src/main/java/org/apache/sedona/common/geometryObjects/Geography.java

+
+import org.locationtech.jts.geom.Geometry;
+
+public class Geography {


I think extends Geometry might be better? CC @Kontinuation

I agree with Dewey that we'd better define Geography differently to avoid possible misuse. The internal representation of Geography may change as we integrate with libraries that supports spherical geometry, then the JTS Geometry representation of Geography will become optional. This design gives us the flexibility of opting out JTS Geometry when it is not needed.

paleolimbot · 2025-02-14T04:10:38Z

Current CI failure:

import os
import pyspark
from sedona.spark import SedonaContext
from sedona.sql.st_functions import ST_AsEWKB

if "SPARK_HOME" in os.environ:
    del os.environ["SPARK_HOME"]
pyspark_version = pyspark.__version__[:pyspark.__version__.rfind(".")]

config = (
    SedonaContext.builder()
    .config(
        "spark.jars",
        "spark-shaded/target/sedona-spark-shaded-3.5_2.12-1.7.1-SNAPSHOT.jar",
    )
    .config(
        "spark.jars.packages",
        "org.datasyslab:geotools-wrapper:1.7.0-28.5",
    )
    .config(
        "spark.jars.repositories",
        "https://artifacts.unidata.ucar.edu/repository/unidata-all",
    )
    .getOrCreate()
)
sedona = SedonaContext.create(config)

df = sedona.createDataFrame([("POINT (0 1)", )], ["wkt"]).selectExpr("ST_geomfromwkt(wkt) as geom")
df.select(ST_AsEWKB(df.geom))
#> AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "st_asewkb(geom)" due to data type mismatch: Parameter 1 requires the "BINARY" type, however "geom" has the type "BINARY".;
#> 'Project [unresolvedalias( **org.apache.spark.sql.sedona_sql.expressions.ST_AsEWKB**  , Some(org.apache.spark.sql.Column$$Lambda$3140/390376711@55e5b47e))]
#> +- Project [ **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**   AS geom#21]
#>    +- LogicalRDD [wkt#19], false

jiayuasu · 2025-02-14T05:24:43Z

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/GeographyUDT.scala

+import org.apache.spark.sql.types._
+import org.json4s.JsonDSL._
+import org.json4s.JsonAST.JValue
+import org.apache.sedona.common.geometryObjects.Geography;


Honestly we probably don't even need to have a new Java Geography in sedona-common because the storage model of Geography is identical to Geometry (unless we want to annotate on the edge interpolation algorithm?). So I would say we just have a GeographyUDT and it uses JTS Geometry out of the box.

Sorry for taking a while to circle back here!

I am worried that a strict subclass will lead to functions that accept a Geometry being able to also implicitly accept a Geography, which in general is a footgun for getting accidentally invalid results (Geometry should only ever be compared with Geometry; Geography should only ever be compared with Geography). A subclass would also tie the implementation of the coordinate storage to JTS, which would mean a copy if we ever use something like s2geometry to do actual computations.

Apologies if I'm missing something obvious here!

james-willis · 2025-02-19T20:15:32Z

common/src/main/java/org/apache/sedona/common/Functions.java

@@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) {
    return GeomUtils.getEWKB(geometry);
  }

+  public static byte[] geogAsEWKB(Geography geography) {


why not call is asEWKB?

If I add that it fails to compile with:

[ERROR] /workspaces/sedona/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala:509: error: ambiguous reference to overloaded definition, [ERROR] both method asEWKB in class Functions of type (x$1: org.apache.sedona.common.geometryObjects.Geography)Array[Byte] [ERROR] and method asEWKB in class Functions of type (x$1: org.locationtech.jts.geom.Geometry)Array[Byte] [ERROR] match expected type org.apache.spark.sql.sedona_sql.expressions.InferrableFunction [ERROR] extends InferredExpression(Functions.asEWKB _) { [ERROR] ^ [ERROR] one error found

(That said, I can't any call to ST_AsWKB() on a geography or a geometry to work after this change, so maybe there is something else wrong!)

There's a (quite complex syntax) for referring to a specific overload of functions in Scala. Even if we support argument type based overloading in Inferred expression, we still have to list all overloads we want to delegate to. For instance

InferredExpression( (g: Geography) => Functions.asEWKB(g), (g: Geometry) => Functions.asEWKB(g))

Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?

james-willis · 2025-02-19T20:16:54Z

common/src/main/java/org/apache/sedona/common/geometryObjects/Geography.java

+import org.locationtech.jts.geom.Geometry;
+
+public class Geography {
+  private Geometry geometry;


is there any extra overhead to using a Geometry internally? like time or space cost in the constructor.

There might be (e.g., we use some other library to do an overlay operation like intersection and have to convert from that library's representation back to JTS). Keeping the field private I think should at least provide a route to changing the internal implementation if there are performance issues in the future (but also happy to hear suggestions otherwise!)

james-willis · 2025-02-19T20:18:30Z

python/sedona/register/java_libs.py

@@ -26,6 +26,7 @@ class SedonaJvmLib(Enum):
    KNNQuery = "org.apache.sedona.core.spatialOperator.KNNQuery"
    RangeQuery = "org.apache.sedona.core.spatialOperator.RangeQuery"
    Envelope = "org.locationtech.jts.geom.Envelope"
+    Geography = "import org.apache.sedona.common.geometryObjects.Geography"


should the word import be here?

Good catch!

jiayuasu · 2025-02-19T21:47:46Z

common/src/main/java/org/apache/sedona/common/Functions.java

@@ -29,6 +29,7 @@
 import java.util.stream.Collectors;
 import org.apache.commons.lang3.tuple.Pair;
 import org.apache.sedona.common.geometryObjects.Circle;
+import org.apache.sedona.common.geometryObjects.Geography;


Let's put the Geography related functions / constructors ... to separate files, instead of mixing with the Geometry function.

Can you also put Geography functions into individual files? The old Functions.java / Constructors.java are too large so it is probably better to put them into individual files such as "GeogFromWKB"?

Totally! I would like to solve the runtime overload problem first...I can try to look harder at how one registers a UDF with more than one signature (maybe it is not possible in Spark?), since that is the part that is currently causing an issue.

It is possible in Spark, the way to do this is quite flexible but awkward.

The implementation of inputTypes can inspect what is the types of the actual expressions passed into it, and return a suitable function signature. Examples are

Summary stats function:

sedona/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterFunctions.scala

Lines 140 to 150 in a8da3cd

override def inputTypes: Seq[AbstractDataType] = {

if (inputExpressions.length == 1) {

Seq(RasterUDT)

} else if (inputExpressions.length == 2) {

Seq(RasterUDT, IntegerType)

} else if (inputExpressions.length == 3) {

Seq(RasterUDT, IntegerType, BooleanType)

} else {

Seq(RasterUDT)

}

}

Inferred expression: https://github.com/apache/sedona/blob/sedona-1.7.0/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala#L61-L79

We can also inspect the types of inputs and run different code depending on input types in eval function:

Summary stats function:

sedona/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterFunctions.scala

Lines 105 to 131 in a8da3cd

override def eval(input: InternalRow): Any = {

// Evaluate the input expressions

val rasterGeom = inputExpressions(0).toRaster(input)

val band = if (inputExpressions.length >= 2) {

inputExpressions(1).eval(input).asInstanceOf[Int]

} else {

1

}

val noData = if (inputExpressions.length >= 3) {

inputExpressions(2).eval(input).asInstanceOf[Boolean]

} else {

true

}

// Check if the raster geometry is null

if (rasterGeom == null) {

null

} else {

val summaryStatsAll = RasterBandAccessors.getSummaryStatsAll(rasterGeom, band, noData)

if (summaryStatsAll == null) {

return null

}

// Create an InternalRow with the summaryStatsAll

InternalRow.fromSeq(summaryStatsAll.map(_.asInstanceOf[Any]))

}

}

Inferred expression encapsulates the above function overloading mechanism of Spark and supports delegating the Spark expression to Java functions according to their arity. It is possible to extend it to support more complex function overloading rules.

jiayuasu · 2025-02-19T21:49:20Z

...k/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes}
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util.ArrayData
-import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
+import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT}


For functions that support both Geometry and Geography, I believe we should overload the functions with Geometry and Geography function. This is not currently supported by InferredExpression? @Kontinuation @zhangfengcdt

(That is how PostGIS does it, although we can also punt on that and use a different prefix if we have to do that to get started)

I think creating a base class for GeographyUDT and GeometryUDT can promote code reuse and simplify the implementation of common operations. By doing this, we can ensure a clear hierarchy and facilitate maintainability, especially when dealing with operations applicable to both geography and geometry types.

Also, for the Geography and Geometry, we can define a common base GeoObject as well, and have a Geometry implementation in Sedona that extends this base class but use JTS Geometry internally to implement algorithms.

Also our RasterUDT can also potentially rebased on this as well.

I quite like the idea of having base classes that we own (e.g., it would enable things like SpatialRDD<T extends SedonaAbstractGeo> without making Geography/Raster a subclass of Geometry). I wonder if that level of refactoring we might want to defer slightly (and keep our initial efforts focused on the minimum requirement for iceberg?)

Kontinuation · 2025-02-20T01:44:24Z

If we have Geography defined as public class Geography { private Geometry geometry; }, we can make geometry functions work for geography using the following approach:

We define an annotation named SupportGeography for annotating function parameters. Functions that do not treat geometry and geography differently could add this annotation to their parameters:

public static int nPoints(@SupportGeography Geometry geometry) {
  return geometry.getNumPoints();
}

InferredExpression will extract the annotations of these functions and call the delegated function even when a geography argument is passed in. It will extract the geometry value from the geography object and pass that geometry value into the function.

I have not verified this idea yet, will try it out next week.

This is only for functions that does not treat geometry and geography differently. For functions that has special semantic when operating on geography, we still need to support argument type based function overloading.

zhangfengcdt

just some thoughts on the types

zhangfengcdt · 2025-02-20T18:35:00Z

...k/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes}
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util.ArrayData
-import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
+import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT}


I think creating a base class for GeographyUDT and GeometryUDT can promote code reuse and simplify the implementation of common operations. By doing this, we can ensure a clear hierarchy and facilitate maintainability, especially when dealing with operations applicable to both geography and geometry types.

Also, for the Geography and Geometry, we can define a common base GeoObject as well, and have a Geometry implementation in Sedona that extends this base class but use JTS Geometry internally to implement algorithms.

zhangfengcdt · 2025-02-20T18:35:44Z

...k/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes}
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util.ArrayData
-import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
+import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT}


Also our RasterUDT can also potentially rebased on this as well.

zhangfengcdt · 2025-02-20T18:38:49Z

common/src/main/java/org/apache/sedona/common/Functions.java

@@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) {
    return GeomUtils.getEWKB(geometry);
  }

+  public static byte[] geogAsEWKB(Geography geography) {


Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?

paleolimbot · 2025-02-20T20:01:19Z

Thank you all for the reviews! I will:

Remove the Python experiments I have here
Special case the one function that has an overload (st_asewkb())

...and see if I can get that to build. Obviously we don't want to have every function special cased and Feng + Kristin's suggestions should be implemented, but we might be able to get the minimum requirement for Iceberg type support before having to invest in that level of refactoring. (@Kontinuation if you end up getting here before I have time to implement that, feel free to let me know and start with the right thing!).

Kontinuation · 2025-02-24T11:26:59Z

(@Kontinuation if you end up getting here before I have time to implement that, feel free to let me know and start with the right thing!).

I have taken over this work and will continue working on #1828

paleolimbot added 3 commits February 13, 2025 21:16

start basic object

492ad69

compiling udt class

26be123

maybe actually register

fa7b1cd

github-actions bot added sedona-common sedona-spark labels Feb 13, 2025

paleolimbot changed the title ~~[WIP] Add Geography userdefined type~~ [WIP] Add Geography user-defined type Feb 13, 2025

paleolimbot added 3 commits February 13, 2025 23:15

add one input and one output function

b788df4

maybe one input and one output function

3ab7945

maybe builds

5fd0fff

jiayuasu reviewed Feb 13, 2025

View reviewed changes

paleolimbot added 2 commits February 14, 2025 04:08

some possible Python requirements

df46d14

format

1d33c4d

github-actions bot added the sedona-python label Feb 14, 2025

maybe a few more references to geography

36ebf17

jiayuasu reviewed Feb 14, 2025

View reviewed changes

james-willis reviewed Feb 19, 2025

View reviewed changes

remove word

3a0c714

jiayuasu reviewed Feb 19, 2025

View reviewed changes

zhangfengcdt reviewed Feb 20, 2025

View reviewed changes

Kontinuation mentioned this pull request Feb 24, 2025

[SEDONA-711] Add Geography user-defined type #1828

Merged

jiayuasu closed this Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add Geography user-defined type #1811

[WIP] Add Geography user-defined type #1811

paleolimbot commented Feb 13, 2025 •

edited

Loading

jiayuasu Feb 13, 2025

Kontinuation Feb 20, 2025

paleolimbot commented Feb 14, 2025 •

edited

Loading

jiayuasu Feb 14, 2025

paleolimbot Feb 17, 2025

james-willis Feb 19, 2025

paleolimbot Feb 19, 2025

Kontinuation Feb 20, 2025

zhangfengcdt Feb 20, 2025

james-willis Feb 19, 2025

paleolimbot Feb 19, 2025

james-willis Feb 19, 2025

paleolimbot Feb 19, 2025

jiayuasu Feb 19, 2025

paleolimbot Feb 19, 2025

Kontinuation Feb 20, 2025

jiayuasu Feb 19, 2025

paleolimbot Feb 19, 2025

zhangfengcdt Feb 20, 2025

zhangfengcdt Feb 20, 2025

paleolimbot Feb 20, 2025

Kontinuation commented Feb 20, 2025

zhangfengcdt left a comment

zhangfengcdt Feb 20, 2025

zhangfengcdt Feb 20, 2025

zhangfengcdt Feb 20, 2025

paleolimbot commented Feb 20, 2025

Kontinuation commented Feb 24, 2025


		import org.locationtech.jts.geom.Geometry;

		public class Geography {

	override def inputTypes: Seq[AbstractDataType] = {
	if (inputExpressions.length == 1) {
	Seq(RasterUDT)
	} else if (inputExpressions.length == 2) {
	Seq(RasterUDT, IntegerType)
	} else if (inputExpressions.length == 3) {
	Seq(RasterUDT, IntegerType, BooleanType)
	} else {
	Seq(RasterUDT)
	}
	}

	override def eval(input: InternalRow): Any = {
	// Evaluate the input expressions
	val rasterGeom = inputExpressions(0).toRaster(input)
	val band = if (inputExpressions.length >= 2) {
	inputExpressions(1).eval(input).asInstanceOf[Int]
	} else {
	1
	}
	val noData = if (inputExpressions.length >= 3) {
	inputExpressions(2).eval(input).asInstanceOf[Boolean]
	} else {
	true
	}

	// Check if the raster geometry is null
	if (rasterGeom == null) {
	null
	} else {
	val summaryStatsAll = RasterBandAccessors.getSummaryStatsAll(rasterGeom, band, noData)

	if (summaryStatsAll == null) {
	return null
	}
	// Create an InternalRow with the summaryStatsAll
	InternalRow.fromSeq(summaryStatsAll.map(_.asInstanceOf[Any]))
	}
	}

[WIP] Add Geography user-defined type #1811

[WIP] Add Geography user-defined type #1811

Conversation

paleolimbot commented Feb 13, 2025 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kontinuation commented Feb 20, 2025

zhangfengcdt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Feb 20, 2025

Kontinuation commented Feb 24, 2025

paleolimbot commented Feb 13, 2025 •

edited

Loading

paleolimbot commented Feb 14, 2025 •

edited

Loading