-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Geography user-defined type #1811
Conversation
|
||
import org.locationtech.jts.geom.Geometry; | ||
|
||
public class Geography { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think extends Geometry
might be better? CC @Kontinuation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Dewey that we'd better define Geography differently to avoid possible misuse. The internal representation of Geography may change as we integrate with libraries that supports spherical geometry, then the JTS Geometry representation of Geography will become optional. This design gives us the flexibility of opting out JTS Geometry when it is not needed.
Current CI failure: import os
import pyspark
from sedona.spark import SedonaContext
from sedona.sql.st_functions import ST_AsEWKB if "SPARK_HOME" in os.environ:
del os.environ["SPARK_HOME"]
pyspark_version = pyspark.__version__[:pyspark.__version__.rfind(".")]
config = (
SedonaContext.builder()
.config(
"spark.jars",
"spark-shaded/target/sedona-spark-shaded-3.5_2.12-1.7.1-SNAPSHOT.jar",
)
.config(
"spark.jars.packages",
"org.datasyslab:geotools-wrapper:1.7.0-28.5",
)
.config(
"spark.jars.repositories",
"https://artifacts.unidata.ucar.edu/repository/unidata-all",
)
.getOrCreate()
)
sedona = SedonaContext.create(config) df = sedona.createDataFrame([("POINT (0 1)", )], ["wkt"]).selectExpr("ST_geomfromwkt(wkt) as geom")
df.select(ST_AsEWKB(df.geom))
#> AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "st_asewkb(geom)" due to data type mismatch: Parameter 1 requires the "BINARY" type, however "geom" has the type "BINARY".;
#> 'Project [unresolvedalias( **org.apache.spark.sql.sedona_sql.expressions.ST_AsEWKB** , Some(org.apache.spark.sql.Column$$Lambda$3140/390376711@55e5b47e))]
#> +- Project [ **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT** AS geom#21]
#> +- LogicalRDD [wkt#19], false |
import org.apache.spark.sql.types._ | ||
import org.json4s.JsonDSL._ | ||
import org.json4s.JsonAST.JValue | ||
import org.apache.sedona.common.geometryObjects.Geography; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly we probably don't even need to have a new Java Geography in sedona-common because the storage model of Geography is identical to Geometry (unless we want to annotate on the edge interpolation algorithm?). So I would say we just have a GeographyUDT and it uses JTS Geometry out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking a while to circle back here!
I am worried that a strict subclass will lead to functions that accept a Geometry
being able to also implicitly accept a Geography
, which in general is a footgun for getting accidentally invalid results (Geometry
should only ever be compared with Geometry
; Geography
should only ever be compared with Geography
). A subclass would also tie the implementation of the coordinate storage to JTS, which would mean a copy if we ever use something like s2geometry to do actual computations.
Apologies if I'm missing something obvious here!
@@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) { | |||
return GeomUtils.getEWKB(geometry); | |||
} | |||
|
|||
public static byte[] geogAsEWKB(Geography geography) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not call is asEWKB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I add that it fails to compile with:
[ERROR] /workspaces/sedona/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala:509: error: ambiguous reference to overloaded definition,
[ERROR] both method asEWKB in class Functions of type (x$1: org.apache.sedona.common.geometryObjects.Geography)Array[Byte]
[ERROR] and method asEWKB in class Functions of type (x$1: org.locationtech.jts.geom.Geometry)Array[Byte]
[ERROR] match expected type org.apache.spark.sql.sedona_sql.expressions.InferrableFunction
[ERROR] extends InferredExpression(Functions.asEWKB _) {
[ERROR] ^
[ERROR] one error found
(That said, I can't any call to ST_AsWKB()
on a geography or a geometry to work after this change, so maybe there is something else wrong!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a (quite complex syntax) for referring to a specific overload of functions in Scala. Even if we support argument type based overloading in Inferred expression, we still have to list all overloads we want to delegate to. For instance
InferredExpression(
(g: Geography) => Functions.asEWKB(g),
(g: Geometry) => Functions.asEWKB(g))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?
import org.locationtech.jts.geom.Geometry; | ||
|
||
public class Geography { | ||
private Geometry geometry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any extra overhead to using a Geometry internally? like time or space cost in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be (e.g., we use some other library to do an overlay operation like intersection and have to convert from that library's representation back to JTS). Keeping the field private I think should at least provide a route to changing the internal implementation if there are performance issues in the future (but also happy to hear suggestions otherwise!)
python/sedona/register/java_libs.py
Outdated
@@ -26,6 +26,7 @@ class SedonaJvmLib(Enum): | |||
KNNQuery = "org.apache.sedona.core.spatialOperator.KNNQuery" | |||
RangeQuery = "org.apache.sedona.core.spatialOperator.RangeQuery" | |||
Envelope = "org.locationtech.jts.geom.Envelope" | |||
Geography = "import org.apache.sedona.common.geometryObjects.Geography" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the word import be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@@ -29,6 +29,7 @@ | |||
import java.util.stream.Collectors; | |||
import org.apache.commons.lang3.tuple.Pair; | |||
import org.apache.sedona.common.geometryObjects.Circle; | |||
import org.apache.sedona.common.geometryObjects.Geography; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the Geography related functions / constructors ... to separate files, instead of mixing with the Geometry function.
Can you also put Geography functions into individual files? The old Functions.java
/ Constructors.java
are too large so it is probably better to put them into individual files such as "GeogFromWKB"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally! I would like to solve the runtime overload problem first...I can try to look harder at how one registers a UDF with more than one signature (maybe it is not possible in Spark?), since that is the part that is currently causing an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible in Spark, the way to do this is quite flexible but awkward.
The implementation of inputTypes
can inspect what is the types of the actual expressions passed into it, and return a suitable function signature. Examples are
- Summary stats function:
Lines 140 to 150 in a8da3cd
override def inputTypes: Seq[AbstractDataType] = { if (inputExpressions.length == 1) { Seq(RasterUDT) } else if (inputExpressions.length == 2) { Seq(RasterUDT, IntegerType) } else if (inputExpressions.length == 3) { Seq(RasterUDT, IntegerType, BooleanType) } else { Seq(RasterUDT) } } - Inferred expression: https://github.com/apache/sedona/blob/sedona-1.7.0/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala#L61-L79
We can also inspect the types of inputs and run different code depending on input types in eval function:
- Summary stats function:
Lines 105 to 131 in a8da3cd
override def eval(input: InternalRow): Any = { // Evaluate the input expressions val rasterGeom = inputExpressions(0).toRaster(input) val band = if (inputExpressions.length >= 2) { inputExpressions(1).eval(input).asInstanceOf[Int] } else { 1 } val noData = if (inputExpressions.length >= 3) { inputExpressions(2).eval(input).asInstanceOf[Boolean] } else { true } // Check if the raster geometry is null if (rasterGeom == null) { null } else { val summaryStatsAll = RasterBandAccessors.getSummaryStatsAll(rasterGeom, band, noData) if (summaryStatsAll == null) { return null } // Create an InternalRow with the summaryStatsAll InternalRow.fromSeq(summaryStatsAll.map(_.asInstanceOf[Any])) } }
Inferred expression encapsulates the above function overloading mechanism of Spark and supports delegating the Spark expression to Java functions according to their arity. It is possible to extend it to support more complex function overloading rules.
@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes} | |||
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback | |||
import org.apache.spark.sql.catalyst.util.ArrayData | |||
import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT | |||
import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For functions that support both Geometry and Geography, I believe we should overload the functions with Geometry and Geography function. This is not currently supported by InferredExpression
? @Kontinuation @zhangfengcdt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That is how PostGIS does it, although we can also punt on that and use a different prefix if we have to do that to get started)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating a base class for GeographyUDT and GeometryUDT can promote code reuse and simplify the implementation of common operations. By doing this, we can ensure a clear hierarchy and facilitate maintainability, especially when dealing with operations applicable to both geography and geometry types.
Also, for the Geography and Geometry, we can define a common base GeoObject as well, and have a Geometry implementation in Sedona that extends this base class but use JTS Geometry internally to implement algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also our RasterUDT can also potentially rebased on this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like the idea of having base classes that we own (e.g., it would enable things like SpatialRDD<T extends SedonaAbstractGeo>
without making Geography/Raster a subclass of Geometry). I wonder if that level of refactoring we might want to defer slightly (and keep our initial efforts focused on the minimum requirement for iceberg?)
If we have We define an annotation named public static int nPoints(@SupportGeography Geometry geometry) {
return geometry.getNumPoints();
}
I have not verified this idea yet, will try it out next week. This is only for functions that does not treat geometry and geography differently. For functions that has special semantic when operating on geography, we still need to support argument type based function overloading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some thoughts on the types
@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes} | |||
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback | |||
import org.apache.spark.sql.catalyst.util.ArrayData | |||
import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT | |||
import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating a base class for GeographyUDT and GeometryUDT can promote code reuse and simplify the implementation of common operations. By doing this, we can ensure a clear hierarchy and facilitate maintainability, especially when dealing with operations applicable to both geography and geometry types.
Also, for the Geography and Geometry, we can define a common base GeoObject as well, and have a Geometry implementation in Sedona that extends this base class but use JTS Geometry internally to implement algorithms.
@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes} | |||
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback | |||
import org.apache.spark.sql.catalyst.util.ArrayData | |||
import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT | |||
import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also our RasterUDT can also potentially rebased on this as well.
@@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) { | |||
return GeomUtils.getEWKB(geometry); | |||
} | |||
|
|||
public static byte[] geogAsEWKB(Geography geography) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?
Thank you all for the reviews! I will:
...and see if I can get that to build. Obviously we don't want to have every function special cased and Feng + Kristin's suggestions should be implemented, but we might be able to get the minimum requirement for Iceberg type support before having to invest in that level of refactoring. (@Kontinuation if you end up getting here before I have time to implement that, feel free to let me know and start with the right thing!). |
I have taken over this work and will continue working on #1828 |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-711] my subject
.What changes were proposed in this PR?
This PR adds a
GeographyUDT
to match theGeometryUDT
. This matches how the types are defined in Iceberg and Parquet, and maps to the concepts of"edges": "spherical"
in GeoParquet and GeoArrow.I'm happy to discuss the scope of the initial implementation in this PR...I'd propose starting with a thin wrapper around a JTS geometry (After all: coordinates are coordinates! And many structural operations are identical in a geometry/geography context). Things like serializers and WKB/WKT IO are also nicely defined for a Geometry already and I don't think there's a strong case for reinventing those in Java land (but feel free to correct me!).
How was this patch tested?
Work in progress!
Did this PR include necessary documentation updates?
Work in progress!