Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Geography user-defined type #1811

Closed
wants to merge 10 commits into from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 13, 2025

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

This PR adds a GeographyUDT to match the GeometryUDT. This matches how the types are defined in Iceberg and Parquet, and maps to the concepts of "edges": "spherical" in GeoParquet and GeoArrow.

I'm happy to discuss the scope of the initial implementation in this PR...I'd propose starting with a thin wrapper around a JTS geometry (After all: coordinates are coordinates! And many structural operations are identical in a geometry/geography context). Things like serializers and WKB/WKT IO are also nicely defined for a Geometry already and I don't think there's a strong case for reinventing those in Java land (but feel free to correct me!).

How was this patch tested?

Work in progress!

Did this PR include necessary documentation updates?

Work in progress!

@paleolimbot paleolimbot changed the title [WIP] Add Geography userdefined type [WIP] Add Geography user-defined type Feb 13, 2025

import org.locationtech.jts.geom.Geometry;

public class Geography {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think extends Geometry might be better? CC @Kontinuation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dewey that we'd better define Geography differently to avoid possible misuse. The internal representation of Geography may change as we integrate with libraries that supports spherical geometry, then the JTS Geometry representation of Geography will become optional. This design gives us the flexibility of opting out JTS Geometry when it is not needed.

@paleolimbot
Copy link
Member Author

paleolimbot commented Feb 14, 2025

Current CI failure:

import os
import pyspark
from sedona.spark import SedonaContext
from sedona.sql.st_functions import ST_AsEWKB
if "SPARK_HOME" in os.environ:
    del os.environ["SPARK_HOME"]
pyspark_version = pyspark.__version__[:pyspark.__version__.rfind(".")]

config = (
    SedonaContext.builder()
    .config(
        "spark.jars",
        "spark-shaded/target/sedona-spark-shaded-3.5_2.12-1.7.1-SNAPSHOT.jar",
    )
    .config(
        "spark.jars.packages",
        "org.datasyslab:geotools-wrapper:1.7.0-28.5",
    )
    .config(
        "spark.jars.repositories",
        "https://artifacts.unidata.ucar.edu/repository/unidata-all",
    )
    .getOrCreate()
)
sedona = SedonaContext.create(config)
df = sedona.createDataFrame([("POINT (0 1)", )], ["wkt"]).selectExpr("ST_geomfromwkt(wkt) as geom")
df.select(ST_AsEWKB(df.geom))
#> AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "st_asewkb(geom)" due to data type mismatch: Parameter 1 requires the "BINARY" type, however "geom" has the type "BINARY".;
#> 'Project [unresolvedalias( **org.apache.spark.sql.sedona_sql.expressions.ST_AsEWKB**  , Some(org.apache.spark.sql.Column$$Lambda$3140/390376711@55e5b47e))]
#> +- Project [ **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**   AS geom#21]
#>    +- LogicalRDD [wkt#19], false

import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.JsonAST.JValue
import org.apache.sedona.common.geometryObjects.Geography;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly we probably don't even need to have a new Java Geography in sedona-common because the storage model of Geography is identical to Geometry (unless we want to annotate on the edge interpolation algorithm?). So I would say we just have a GeographyUDT and it uses JTS Geometry out of the box.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking a while to circle back here!

I am worried that a strict subclass will lead to functions that accept a Geometry being able to also implicitly accept a Geography, which in general is a footgun for getting accidentally invalid results (Geometry should only ever be compared with Geometry; Geography should only ever be compared with Geography). A subclass would also tie the implementation of the coordinate storage to JTS, which would mean a copy if we ever use something like s2geometry to do actual computations.

Apologies if I'm missing something obvious here!

@@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) {
return GeomUtils.getEWKB(geometry);
}

public static byte[] geogAsEWKB(Geography geography) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call is asEWKB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add that it fails to compile with:

[ERROR] /workspaces/sedona/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala:509: error: ambiguous reference to overloaded definition,
[ERROR] both method asEWKB in class Functions of type (x$1: org.apache.sedona.common.geometryObjects.Geography)Array[Byte]
[ERROR] and  method asEWKB in class Functions of type (x$1: org.locationtech.jts.geom.Geometry)Array[Byte]
[ERROR] match expected type org.apache.spark.sql.sedona_sql.expressions.InferrableFunction
[ERROR]     extends InferredExpression(Functions.asEWKB _) {
[ERROR]                                          ^
[ERROR] one error found

(That said, I can't any call to ST_AsWKB() on a geography or a geometry to work after this change, so maybe there is something else wrong!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a (quite complex syntax) for referring to a specific overload of functions in Scala. Even if we support argument type based overloading in Inferred expression, we still have to list all overloads we want to delegate to. For instance

InferredExpression(
  (g: Geography) => Functions.asEWKB(g),
  (g: Geometry) => Functions.asEWKB(g))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?

import org.locationtech.jts.geom.Geometry;

public class Geography {
private Geometry geometry;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any extra overhead to using a Geometry internally? like time or space cost in the constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be (e.g., we use some other library to do an overlay operation like intersection and have to convert from that library's representation back to JTS). Keeping the field private I think should at least provide a route to changing the internal implementation if there are performance issues in the future (but also happy to hear suggestions otherwise!)

@@ -26,6 +26,7 @@ class SedonaJvmLib(Enum):
KNNQuery = "org.apache.sedona.core.spatialOperator.KNNQuery"
RangeQuery = "org.apache.sedona.core.spatialOperator.RangeQuery"
Envelope = "org.locationtech.jts.geom.Envelope"
Geography = "import org.apache.sedona.common.geometryObjects.Geography"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the word import be here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -29,6 +29,7 @@
import java.util.stream.Collectors;
import org.apache.commons.lang3.tuple.Pair;
import org.apache.sedona.common.geometryObjects.Circle;
import org.apache.sedona.common.geometryObjects.Geography;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the Geography related functions / constructors ... to separate files, instead of mixing with the Geometry function.

Can you also put Geography functions into individual files? The old Functions.java / Constructors.java are too large so it is probably better to put them into individual files such as "GeogFromWKB"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I would like to solve the runtime overload problem first...I can try to look harder at how one registers a UDF with more than one signature (maybe it is not possible in Spark?), since that is the part that is currently causing an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible in Spark, the way to do this is quite flexible but awkward.

The implementation of inputTypes can inspect what is the types of the actual expressions passed into it, and return a suitable function signature. Examples are

We can also inspect the types of inputs and run different code depending on input types in eval function:

  • Summary stats function:
    override def eval(input: InternalRow): Any = {
    // Evaluate the input expressions
    val rasterGeom = inputExpressions(0).toRaster(input)
    val band = if (inputExpressions.length >= 2) {
    inputExpressions(1).eval(input).asInstanceOf[Int]
    } else {
    1
    }
    val noData = if (inputExpressions.length >= 3) {
    inputExpressions(2).eval(input).asInstanceOf[Boolean]
    } else {
    true
    }
    // Check if the raster geometry is null
    if (rasterGeom == null) {
    null
    } else {
    val summaryStatsAll = RasterBandAccessors.getSummaryStatsAll(rasterGeom, band, noData)
    if (summaryStatsAll == null) {
    return null
    }
    // Create an InternalRow with the summaryStatsAll
    InternalRow.fromSeq(summaryStatsAll.map(_.asInstanceOf[Any]))
    }
    }

Inferred expression encapsulates the above function overloading mechanism of Spark and supports delegating the Spark expression to Java functions according to their arity. It is possible to extend it to support more complex function overloading rules.

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes}
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
import org.apache.spark.sql.catalyst.util.ArrayData
import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For functions that support both Geometry and Geography, I believe we should overload the functions with Geometry and Geography function. This is not currently supported by InferredExpression? @Kontinuation @zhangfengcdt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That is how PostGIS does it, although we can also punt on that and use a different prefix if we have to do that to get started)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating a base class for GeographyUDT and GeometryUDT can promote code reuse and simplify the implementation of common operations. By doing this, we can ensure a clear hierarchy and facilitate maintainability, especially when dealing with operations applicable to both geography and geometry types.

Also, for the Geography and Geometry, we can define a common base GeoObject as well, and have a Geometry implementation in Sedona that extends this base class but use JTS Geometry internally to implement algorithms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also our RasterUDT can also potentially rebased on this as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like the idea of having base classes that we own (e.g., it would enable things like SpatialRDD<T extends SedonaAbstractGeo> without making Geography/Raster a subclass of Geometry). I wonder if that level of refactoring we might want to defer slightly (and keep our initial efforts focused on the minimum requirement for iceberg?)

@Kontinuation
Copy link
Member

If we have Geography defined as public class Geography { private Geometry geometry; }, we can make geometry functions work for geography using the following approach:

We define an annotation named SupportGeography for annotating function parameters. Functions that do not treat geometry and geography differently could add this annotation to their parameters:

public static int nPoints(@SupportGeography Geometry geometry) {
  return geometry.getNumPoints();
}

InferredExpression will extract the annotations of these functions and call the delegated function even when a geography argument is passed in. It will extract the geometry value from the geography object and pass that geometry value into the function.

I have not verified this idea yet, will try it out next week.

This is only for functions that does not treat geometry and geography differently. For functions that has special semantic when operating on geography, we still need to support argument type based function overloading.

Copy link
Collaborator

@zhangfengcdt zhangfengcdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some thoughts on the types

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes}
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
import org.apache.spark.sql.catalyst.util.ArrayData
import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating a base class for GeographyUDT and GeometryUDT can promote code reuse and simplify the implementation of common operations. By doing this, we can ensure a clear hierarchy and facilitate maintainability, especially when dealing with operations applicable to both geography and geometry types.

Also, for the Geography and Geometry, we can define a common base GeoObject as well, and have a Geometry implementation in Sedona that extends this base class but use JTS Geometry internally to implement algorithms.

@@ -23,7 +23,7 @@ import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.{Expression, ImplicitCastInputTypes}
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
import org.apache.spark.sql.catalyst.util.ArrayData
import org.apache.spark.sql.sedona_sql.UDT.GeometryUDT
import org.apache.spark.sql.sedona_sql.UDT.{GeometryUDT, GeographyUDT}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also our RasterUDT can also potentially rebased on this as well.

@@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) {
return GeomUtils.getEWKB(geometry);
}

public static byte[] geogAsEWKB(Geography geography) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?

@paleolimbot
Copy link
Member Author

Thank you all for the reviews! I will:

  • Remove the Python experiments I have here
  • Special case the one function that has an overload (st_asewkb())

...and see if I can get that to build. Obviously we don't want to have every function special cased and Feng + Kristin's suggestions should be implemented, but we might be able to get the minimum requirement for Iceberg type support before having to invest in that level of refactoring. (@Kontinuation if you end up getting here before I have time to implement that, feel free to let me know and start with the right thing!).

@Kontinuation
Copy link
Member

(@Kontinuation if you end up getting here before I have time to implement that, feel free to let me know and start with the right thing!).

I have taken over this work and will continue working on #1828

@jiayuasu jiayuasu closed this Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants