Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Geography user-defined type #1811

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import javax.xml.parsers.ParserConfigurationException;
import org.apache.sedona.common.enums.FileDataSplitter;
import org.apache.sedona.common.enums.GeometryType;
import org.apache.sedona.common.geometryObjects.Geography;
import org.apache.sedona.common.utils.FormatUtils;
import org.apache.sedona.common.utils.GeoHashDecoder;
import org.locationtech.jts.geom.*;
Expand All @@ -44,6 +45,10 @@ public static Geometry geomFromWKT(String wkt, int srid) throws ParseException {
return new WKTReader(geometryFactory).read(wkt);
}

public static Geography geogFromWKT(String wkt, int srid) throws ParseException {
return new Geography(geomFromWKT(wkt, srid));
}

public static Geometry geomFromEWKT(String ewkt) throws ParseException {
if (ewkt == null) {
return null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
import java.util.stream.Collectors;
import org.apache.commons.lang3.tuple.Pair;
import org.apache.sedona.common.geometryObjects.Circle;
import org.apache.sedona.common.geometryObjects.Geography;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the Geography related functions / constructors ... to separate files, instead of mixing with the Geometry function.

Can you also put Geography functions into individual files? The old Functions.java / Constructors.java are too large so it is probably better to put them into individual files such as "GeogFromWKB"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I would like to solve the runtime overload problem first...I can try to look harder at how one registers a UDF with more than one signature (maybe it is not possible in Spark?), since that is the part that is currently causing an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible in Spark, the way to do this is quite flexible but awkward.

The implementation of inputTypes can inspect what is the types of the actual expressions passed into it, and return a suitable function signature. Examples are

We can also inspect the types of inputs and run different code depending on input types in eval function:

  • Summary stats function:
    override def eval(input: InternalRow): Any = {
    // Evaluate the input expressions
    val rasterGeom = inputExpressions(0).toRaster(input)
    val band = if (inputExpressions.length >= 2) {
    inputExpressions(1).eval(input).asInstanceOf[Int]
    } else {
    1
    }
    val noData = if (inputExpressions.length >= 3) {
    inputExpressions(2).eval(input).asInstanceOf[Boolean]
    } else {
    true
    }
    // Check if the raster geometry is null
    if (rasterGeom == null) {
    null
    } else {
    val summaryStatsAll = RasterBandAccessors.getSummaryStatsAll(rasterGeom, band, noData)
    if (summaryStatsAll == null) {
    return null
    }
    // Create an InternalRow with the summaryStatsAll
    InternalRow.fromSeq(summaryStatsAll.map(_.asInstanceOf[Any]))
    }
    }

Inferred expression encapsulates the above function overloading mechanism of Spark and supports delegating the Spark expression to Java functions according to their arity. It is possible to extend it to support more complex function overloading rules.

import org.apache.sedona.common.sphere.Spheroid;
import org.apache.sedona.common.subDivide.GeometrySubDivider;
import org.apache.sedona.common.utils.*;
Expand Down Expand Up @@ -784,6 +785,10 @@ public static byte[] asEWKB(Geometry geometry) {
return GeomUtils.getEWKB(geometry);
}

public static byte[] geogAsEWKB(Geography geography) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call is asEWKB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add that it fails to compile with:

[ERROR] /workspaces/sedona/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala:509: error: ambiguous reference to overloaded definition,
[ERROR] both method asEWKB in class Functions of type (x$1: org.apache.sedona.common.geometryObjects.Geography)Array[Byte]
[ERROR] and  method asEWKB in class Functions of type (x$1: org.locationtech.jts.geom.Geometry)Array[Byte]
[ERROR] match expected type org.apache.spark.sql.sedona_sql.expressions.InferrableFunction
[ERROR]     extends InferredExpression(Functions.asEWKB _) {
[ERROR]                                          ^
[ERROR] one error found

(That said, I can't any call to ST_AsWKB() on a geography or a geometry to work after this change, so maybe there is something else wrong!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a (quite complex syntax) for referring to a specific overload of functions in Scala. Even if we support argument type based overloading in Inferred expression, we still have to list all overloads we want to delegate to. For instance

InferredExpression(
  (g: Geography) => Functions.asEWKB(g),
  (g: Geometry) => Functions.asEWKB(g))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Geometry and Geography from different packages / repositories looks a bit strange to me, though I understand the tradeoff here. Is it a good opportunity to formally introduce the sedona geo types for this in this effort?

return asEWKB(geography.getGeometry());
}

public static String asHexEWKB(Geometry geom, String endian) {
if (endian.equalsIgnoreCase("NDR")) {
return GeomUtils.getHexEWKB(geom, ByteOrderValues.LITTLE_ENDIAN);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.sedona.common.geometryObjects;

import org.locationtech.jts.geom.Geometry;

public class Geography {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think extends Geometry might be better? CC @Kontinuation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dewey that we'd better define Geography differently to avoid possible misuse. The internal representation of Geography may change as we integrate with libraries that supports spherical geometry, then the JTS Geometry representation of Geography will become optional. This design gives us the flexibility of opting out JTS Geometry when it is not needed.

private Geometry geometry;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any extra overhead to using a Geometry internally? like time or space cost in the constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be (e.g., we use some other library to do an overlay operation like intersection and have to convert from that library's representation back to JTS). Keeping the field private I think should at least provide a route to changing the internal implementation if there are performance issues in the future (but also happy to hear suggestions otherwise!)


public Geography(Geometry geometry) {
this.geometry = geometry;
}

public Geometry getGeometry() {
return this.geometry;
}

public String toString() {
return this.geometry.toText();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import com.esotericsoftware.kryo.io.Output;
import java.io.Serializable;
import org.apache.sedona.common.geometryObjects.Circle;
import org.apache.sedona.common.geometryObjects.Geography;
import org.locationtech.jts.geom.Envelope;
import org.locationtech.jts.geom.Geometry;
import org.locationtech.jts.geom.GeometryCollection;
Expand All @@ -36,7 +37,7 @@
* Provides methods to efficiently serialize and deserialize geometry types.
*
* <p>Supports Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon,
* GeometryCollection, Circle and Envelope types.
* GeometryCollection, Circle, Envelope, and Geography types.
*
* <p>First byte contains {@link Type#id}. Then go type-specific bytes, followed by user-data
* attached to the geometry.
Expand All @@ -63,6 +64,9 @@ public void write(Kryo kryo, Output out, Object object) {
out.writeDouble(envelope.getMaxX());
out.writeDouble(envelope.getMinY());
out.writeDouble(envelope.getMaxY());
} else if (object instanceof Geography) {
writeType(out, Type.GEOGRAPHY);
writeGeometry(kryo, out, ((Geography) object).getGeometry());
} else {
throw new UnsupportedOperationException(
"Cannot serialize object of type " + object.getClass().getName());
Expand Down Expand Up @@ -118,6 +122,10 @@ public Object read(Kryo kryo, Input input, Class aClass) {
return new Envelope();
}
}
case GEOGRAPHY:
{
return new Geography(readGeometry(kryo, input));
}
default:
throw new UnsupportedOperationException(
"Cannot deserialize object of type " + geometryType);
Expand Down Expand Up @@ -145,7 +153,8 @@ private Geometry readGeometry(Kryo kryo, Input input) {
private enum Type {
SHAPE(0),
CIRCLE(1),
ENVELOPE(2);
ENVELOPE(2),
GEOGRAPHY(3);

private final int id;

Expand Down
42 changes: 42 additions & 0 deletions python/sedona/core/geom/geography.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

import pickle

from sedona.utils.decorators import require


class Geography:

def __init__(self, geometry):
self._geom = geometry
self.userData = None

def getUserData(self):
return self.userData

@classmethod
def from_jvm_instance(cls, java_obj):
return Geography(java_obj.geometry)

@classmethod
def serialize_for_java(cls, geogs):
return pickle.dumps(geogs)

@require(["Geography"])
def create_jvm_instance(self, jvm):
return jvm.Geography(self._geom)
1 change: 1 addition & 0 deletions python/sedona/register/java_libs.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ class SedonaJvmLib(Enum):
KNNQuery = "org.apache.sedona.core.spatialOperator.KNNQuery"
RangeQuery = "org.apache.sedona.core.spatialOperator.RangeQuery"
Envelope = "org.locationtech.jts.geom.Envelope"
Geography = "org.apache.sedona.common.geometryObjects.Geography"
GeoSerializerData = (
"org.apache.sedona.python.wrapper.adapters.GeoSparkPythonConverter"
)
Expand Down
2 changes: 1 addition & 1 deletion python/sedona/spark/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
from sedona.sql.st_constructors import *
from sedona.sql.st_functions import *
from sedona.sql.st_predicates import *
from sedona.sql.types import GeometryType, RasterType
from sedona.sql.types import GeometryType, GeographyType, RasterType
from sedona.utils import KryoSerializer, SedonaKryoRegistrator
from sedona.utils.adapter import Adapter
from sedona.utils.geoarrow import dataframe_to_arrow
26 changes: 26 additions & 0 deletions python/sedona/sql/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
SedonaRaster = None

from ..utils import geometry_serde
from ..core.geom.geography import Geography


class GeometryType(UserDefinedType):
Expand Down Expand Up @@ -60,6 +61,31 @@ def scalaUDT(cls):
return "org.apache.spark.sql.sedona_sql.UDT.GeometryUDT"


class GeographyType(UserDefinedType):

@classmethod
def sqlType(cls):
return BinaryType()

def serialize(self, obj):
return geometry_serde.serialize(obj._geom)

def deserialize(self, datum):
geom, offset = geometry_serde.deserialize(datum)
return Geography(geom)

@classmethod
def module(cls):
return "sedona.sql.types"

def needConversion(self):
return True

@classmethod
def scalaUDT(cls):
return "org.apache.spark.sql.sedona_sql.UDT.GeographyUDT"


class RasterType(UserDefinedType):

@classmethod
Expand Down
9 changes: 7 additions & 2 deletions python/sedona/utils/geometry_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,28 @@
# specific language governing permissions and limitations
# under the License.

from typing import Union

from shapely.geometry.base import BaseGeometry

from sedona.core.geom.envelope import Envelope
from sedona.core.geom.geography import Geography
from sedona.core.jvm.translate import JvmGeometryAdapter
from sedona.utils.spatial_rdd_parser import GeometryFactory


class GeometryAdapter:

@classmethod
def create_jvm_geometry_from_base_geometry(cls, jvm, geom: BaseGeometry):
def create_jvm_geometry_from_base_geometry(
cls, jvm, geom: Union[BaseGeometry, Geography]
):
"""
:param jvm:
:param geom:
:return:
"""
if isinstance(geom, Envelope):
if isinstance(geom, (Envelope, Geography)):
jvm_geom = geom.create_jvm_instance(jvm)
else:
decoded_geom = GeometryFactory.to_bytes(geom)
Expand Down
9 changes: 9 additions & 0 deletions python/sedona/utils/prep.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
)
from shapely.geometry.base import BaseGeometry

from ..core.geom.geography import Geography


def assign_all() -> bool:
geoms = [
Expand All @@ -41,6 +43,7 @@ def assign_all() -> bool:
]
assign_udt_shapely_objects(geoms=geoms)
assign_user_data_to_shapely_objects(geoms=geoms)
assign_udt_geography()
return True


Expand All @@ -55,3 +58,9 @@ def assign_udt_shapely_objects(geoms: List[type(BaseGeometry)]) -> bool:
def assign_user_data_to_shapely_objects(geoms: List[type(BaseGeometry)]) -> bool:
for geom in geoms:
geom.getUserData = lambda geom_instance: geom_instance.userData


def assign_udt_geography():
from sedona.sql.types import GeographyType

Geography.__UDT__ = GeographyType()
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.spark.sql.sedona_sql.UDT

import org.apache.sedona.common.geometrySerde.GeometrySerializer;
import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.JsonAST.JValue
import org.apache.sedona.common.geometryObjects.Geography;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly we probably don't even need to have a new Java Geography in sedona-common because the storage model of Geography is identical to Geometry (unless we want to annotate on the edge interpolation algorithm?). So I would say we just have a GeographyUDT and it uses JTS Geometry out of the box.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking a while to circle back here!

I am worried that a strict subclass will lead to functions that accept a Geometry being able to also implicitly accept a Geography, which in general is a footgun for getting accidentally invalid results (Geometry should only ever be compared with Geometry; Geography should only ever be compared with Geography). A subclass would also tie the implementation of the coordinate storage to JTS, which would mean a copy if we ever use something like s2geometry to do actual computations.

Apologies if I'm missing something obvious here!


class GeographyUDT extends UserDefinedType[Geography] {
override def sqlType: DataType = BinaryType

override def pyUDT: String = "sedona.sql.types.GeographyType"

override def userClass: Class[Geography] = classOf[Geography]

override def serialize(obj: Geography): Array[Byte] =
GeometrySerializer.serialize(obj.getGeometry())

override def deserialize(datum: Any): Geography = {
datum match {
case value: Array[Byte] => new Geography(GeometrySerializer.deserialize(value))
}
}

override private[sql] def jsonValue: JValue = {
super.jsonValue mapField {
case ("class", _) => "class" -> this.getClass.getName.stripSuffix("$")
case other: Any => other
}
}

override def equals(other: Any): Boolean = other match {
case _: UserDefinedType[_] => other.isInstanceOf[GeographyUDT]
case _ => false
}

override def hashCode(): Int = userClass.hashCode()
}

case object GeographyUDT
extends org.apache.spark.sql.sedona_sql.UDT.GeographyUDT
with scala.Serializable
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,14 @@ package org.apache.spark.sql.sedona_sql.UDT

import org.apache.spark.sql.types.UDTRegistration
import org.locationtech.jts.geom.Geometry
import org.apache.sedona.common.geometryObjects.Geography;
import org.locationtech.jts.index.SpatialIndex

object UdtRegistratorWrapper {

def registerAll(): Unit = {
UDTRegistration.register(classOf[Geometry].getName, classOf[GeometryUDT].getName)
UDTRegistration.register(classOf[Geography].getName, classOf[GeographyUDT].getName)
UDTRegistration.register(classOf[SpatialIndex].getName, classOf[IndexUDT].getName)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,20 @@ case class ST_GeomFromWKT(inputExpressions: Seq[Expression])
}
}

/**
* Return a Geography from a WKT string
*
* @param inputExpressions
* This function takes a geometry string and a srid. The string format must be WKT.
*/
case class ST_GeogFromWKT(inputExpressions: Seq[Expression])
extends InferredExpression(Constructors.geogFromWKT _) {

protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]) = {
copy(inputExpressions = newChildren)
}
}

/**
* Return a Geometry from a OGC Extended WKT string
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -506,7 +506,7 @@ case class ST_AsBinary(inputExpressions: Seq[Expression])
}

case class ST_AsEWKB(inputExpressions: Seq[Expression])
extends InferredExpression(Functions.asEWKB _) {
extends InferredExpression(Functions.geogAsEWKB _) {

protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]) = {
copy(inputExpressions = newChildren)
Expand Down
Loading
Loading