-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-468] Add useSpheroid provision in ST_DWithin #1208
Conversation
# Conflicts: # spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala # spark/common/src/test/scala/org/apache/sedona/sql/BroadcastIndexJoinSuite.scala # spark/common/src/test/scala/org/apache/sedona/sql/TestBaseScala.scala # spark/common/src/test/scala/org/apache/sedona/sql/predicateJoinTestScala.scala
@@ -132,6 +132,12 @@ class JoinQueryDetector(sparkSession: SparkSession) extends Strategy { | |||
Some(JoinQueryDetection(left, right, leftShape, ST_Buffer(Seq(rightShape, distance)), SpatialPredicate.INTERSECTS, isGeography = false, condition, Some(extraCondition))) | |||
case Some(And(extraCondition, ST_DWithin(Seq(leftShape, rightShape, distance)))) => | |||
Some(JoinQueryDetection(left, right, leftShape, ST_Buffer(Seq(rightShape, distance)), SpatialPredicate.INTERSECTS, isGeography = false, condition, Some(extraCondition))) | |||
case Some(ST_DWithin(Seq(leftShape, rightShape, distance, useSpheroid))) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When useSpheroid
is true
, this should not use ST_Buffer because buffering a lon/lat geometry with meter distance does not make any sense and will lead to wrong result. When useSpheroid = true
, we should make it equivalent to ST_DistanceSpheroid() < XXX
.
I think we should totally get rid of ST_Buffer to avoid confusion. All ST_DWithin
patterns, once get matched, should be translated to distance join, like what we did for ST_Distance and ST_DistanceSpheroid. Buffering a geometry with a spheroid distance is already taken care by the optimized distance join triggered ST_DistanceSpheroid() < XXX
, which is quite complex.
case Some(LessThanOrEqual(ST_Distance(Seq(leftShape, rightShape)), distance)) =>
Some(JoinQueryDetection(left, right, leftShape, rightShape, SpatialPredicate.INTERSECTS, false, condition, Some(distance)))
// ST_DistanceSpheroid
case Some(LessThanOrEqual(ST_DistanceSpheroid(Seq(leftShape, rightShape)), distance)) =>
Some(JoinQueryDetection(left, right, leftShape, rightShape, SpatialPredicate.INTERSECTS, true, condition, Some(distance)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, done
docs/api/flink/Predicate.md
Outdated
@@ -60,9 +60,11 @@ true | |||
|
|||
## ST_DWithin | |||
|
|||
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'. This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance. | |||
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance' (in metres). This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change every Sphere
to Spheroid
in the doc and other code. In the geodesic distance world, Sphere distance
means great circle distance while spheroid distance
means geodesic distance. These are 2 different things.
In our case, we use spheroid
not sphere
.
In addition, it is worth noting that when we use spheroid
distance, there is a current limitation in Sedona: we calculate the distance between the centroids of two geometries, not the shortest distance between two geometries. This tiny error usually does not matter much to end users because spheroid distances are mainly used for getting the distance between two objects across the globe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation should be
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'.
If useSpheroid
is passed true
, ST_DWithin uses Sedona's ST_DistanceSpheroid to check the spheroid distance between the centroids of two geometries. The unit of the distance in this case is meter.
If useSpheroid
is passed false
, ST_DWithin uses Euclidean distance and the unit of the distance is the same as the CRS of the geometries. To obtain the correct result, please consider using ST_Transform to put data in an appropriate CRS.
If useSpheroid
is not given, it defaults to false
docs/api/flink/Predicate.md
Outdated
@@ -60,9 +60,11 @@ true | |||
|
|||
## ST_DWithin | |||
|
|||
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'. This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance. | |||
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance' (in metres). This function essentially checks if the shortest distance between the envelope of the two geometries is <= the provided distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation should be
Introduction: Returns true if 'leftGeometry' and 'rightGeometry' are within a specified 'distance'.
If useSpheroid
is passed true
, ST_DWithin uses Sedona's ST_DistanceSpheroid to check the spheroid distance between the centroids of two geometries. The unit of the distance in this case is meter.
If useSpheroid
is passed false
, ST_DWithin uses Euclidean distance and the unit of the distance is the same as the CRS of the geometries. To obtain the correct result, please consider using ST_Transform to put data in an appropriate CRS.
If useSpheroid
is not given, it defaults to false
docs/api/sql/Predicate.md
Outdated
|
||
Format: `ST_DWithin (leftGeometry: Geometry, rightGeometry: Geometry, distance: Double)` | ||
If `useSpheroid` is passed true, ST_DWithin uses Sedona's ST_DistanceSpheroid. If `useSpheroid` is passed false or not passed at all, DWithin uses Euclidean distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the same explanation above.
@@ -95,6 +99,28 @@ trait TestBaseScala extends FunSpec with BeforeAndAfterAll { | |||
sparkSession.read.format("binaryFile").load(path) | |||
} | |||
|
|||
|
|||
def generateTestData(): Seq[(Int, Double, Geometry)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually generating test data is good. But please do not use randomized data for test because this will create uncertain behaviors in the test and it is really hard to debug. In addition, please make sure it takes a parameter that allows to generate any size of test data as desired.
In addition, when create test data for ST_DWithin join, please make sure you know what the exact number of rows should appear in the result. And you should always use test cases that produce non-zero results.
@Kontinuation Please take a look when you have time. Thanks! |
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala
Outdated
Show resolved
Hide resolved
…rgl" This reverts commit 5352fb9.
@Kontinuation Would you please review again when you have time? Thanks! |
* Add ST_DWithin * Add documentation for ST_DWithin * Remove unwanted code * removed null check test for ST_DWithin * Fix EOF lint error * Add explanation for ST_DWithin * Remove CRS checking logic in ST_DWithin * Add optimized join support for ST_DWithin * Remove test change to resourceFolder * remove unnecessary cast to double * Add broadcast join test * Add example of ST_DWithin in Optimizer.md * Add useSpheroid version to ST_DWithin | Add optimized join support * remove accidental resourceFolder change * Fix mistake in making useSpheroid optional in ST_DWithin * Fix incorrect test data in test_dataframe_api.py * fix failing test in test_predicate.py * Address PR changes | Move ST_DWithin to DistanceJoin * fix failing test * Remove randomness from sphere test case generation * Refactor documentation of ST_DWithin * revert resourceFolder path * Handle complex boolean expressions in ST_DWithin * add a blanket try catch for ST_DWithin to handle complex boolean expressions * add collect to the python test * replace head() with count() * Add null check for geometry column while adding a df to keplergl * Revert "Add null check for geometry column while adding a df to keplergl" This reverts commit 5352fb9.
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-XXX] my subject
.What changes were proposed in this PR?
How was this patch tested?
Did this PR include necessary documentation updates?
vX.Y.Z
format.