From ed70d967bc3ece10178e6c7276d6794b45dd4778 Mon Sep 17 00:00:00 2001 From: Ram Sriharsha Date: Mon, 14 Aug 2017 20:29:51 +0200 Subject: [PATCH] Bump up version for release + Add docs --- README.md | 128 ++++++++++++++++++++++++++++++++++++++++++++---------- build.sbt | 2 +- 2 files changed, 106 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 49ce45e..be61792 100644 --- a/README.md +++ b/README.md @@ -4,48 +4,59 @@ [![codecov.io](http://codecov.io/github/harsha2010/magellan/coverage.svg?branch=master)](http://codecov.io/github/harsha2010/magellan?branch=maste) -Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance -in search and targeted advertising and an important variable in many predictive analytics applications. -For example when a user searches for “canyon hotels”, without location awareness the top result -or sponsored ads might be for hotels in the town “Canyon, TX”. -However, if they are are near the Grand Canyon, the top results or ads should be for nearby hotels. -Thus a search term combined with location context allows for much more relevant results and ads. -Similarly a variety of other predictive analytics problems can leverage location as a context. +Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries. -To leverage spatial context in a predictive analytics application requires us to be able -to parse these datasets at scale, join them with target datasets that contain point in space information, -and answer geometrical queries efficiently. +The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer. -Magellan is an open source library Geospatial Analytics using Spark as the underlying engine. -We leverage Catalyst’s pluggable optimizer to efficiently execute spatial joins, SparkSQL’s powerful operators to express geometric queries in a natural DSL, and Pyspark’s Python integration to provide Python bindings. +Magellan is the first library to extend Spark SQL to provide a relational abstraction for geospatial analytics. I see it as an evolution of geospatial analytics engines into the emerging world of big data by providing abstractions that are developer friendly, can be leveraged by anyone who understands or uses Apache Spark while simultaneously showcasing an execution engine that is state of the art for geospatial analytics on big data. + +# Version Release Notes + +You can find notes on the various released versions [here](https://github.com/harsha2010/magellan/releases) # Linking -You can link against this library using the following coordinates: +You can link against the latest release using the following coordinates: groupId: harsha2010 artifactId: magellan - version: 1.0.4-s_2.11 + version: 1.0.5-s_2.11 # Requirements -This library requires Spark 2.1+ and Scala 2.11 +v1.0.5 requires Spark 2.1+ and Scala 2.11 # Capabilities -The library currently supports the [ESRI](https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf) format files as well as [GeoJSON](http://geojson.org). +The library currently supports reading the following formats: + + * [ESRI](https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf) + * [GeoJSON](http://geojson.org) + * [OSM-XML](http://wiki.openstreetmap.org/wiki/OSM_XML) + * [WKT](https://en.wikipedia.org/wiki/Well-known_text). We aim to support the full suite of [OpenGIS Simple Features for SQL ](http://www.opengeospatial.org/standards/sfs) spatial predicate functions and operators together with additional topological functions. -Capabilities we aim to support include (ones currently available are highlighted): +The following geometries are currently supported: -**Geometries**: **Point**, **LineString**, **Polygon**, **MultiPoint**, **MultiPolygon**, MultiLineString, GeometryCollection - -**Predicates**: **Intersects**, Touches, Disjoint, Crosses, **Within**, **Contains**, Overlaps, Equals, Covers - -**Operations**: Union, Distance, **Intersection**, Symmetric Difference, Convex Hull, Envelope, Buffer, Simplify, Valid, Area, Length +**Geometries**: + + * Point + * LineString + * Polygon + * MultiPoint + * MultiPolygon (treated as a collection of Polygons and read in as a row per polygon by the GeoJSON reader) -**Scala and Python API** +The following predicates are currently supported: + + * Intersects + * Contains + * Within + +The following languages are currently supported: + + * Scala + @@ -158,6 +169,77 @@ A few common packages you might want to import within Magellan A Databricks notebook with similar examples is published [here](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/882779309834027/6891974485343070/latest.html) for convenience. +# Spatial indexes + +Starting v1.0.5, Magellan support spatial indexes. +Spatial indexes supported the so called [ZOrderCurves](https://en.wikipedia.org/wiki/Z-order_curve). + + +Given a column of shapes, one can index the shapes to a given precision using a geohash indexer by doing the following: + +```scala +df.withColumn("index", $"polygon" index 30) +``` + +This produces a new column called ```index``` which is a list of ZOrder Curves of precision ```30``` that taken together cover the polygon. + +# Creating Indexes while loading data + +The Spatial Relations (GeoJSON, Shapefile, OSM-XML) all have the ability to automatically index the geometries while loading them. + +To turn this feature on, pass in the parameter ```magellan.index = true``` and optionally a value for ```magellan.index.precision``` (default = 30) while loading the data as follows: + +```scala +spark.read.format("magellan") + .option("magellan.index", "true") + .option("magellan.index.precision", "25") + .load(s"$path") +``` + +This creates an additional column called ```index``` which holds the list of ZOrder Curves of the given precision that cover each geometry in the dataset. + +# Spatial Joins + +Magellan leverages Spark SQL and has support for joins by default. However, these joins are by default not aware that the columns are geometric so a join of the form + +```scala + points.join(polygons).where($"point" within $"polygon") +``` + +will be treated as a Cartesian Join followed by a predicate. +In some cases (especially when the polygon dataset is small (O(100-10000) polygons) this is fast enough. +However, when the number of polygons is much larger than that, you will need spatial joins to allow you to scale this computation + +To enable spatial joins in Magellan, add a spatial join rule to Spark by injecting the following code before the join: + +```scala + magellan.Utils.injectRules(spark) +``` + + +Furthermore, during the join, you will need to provide Magellan a hint of the precision at which to create indices for the join + +You can do this by annotating either of the dataframes involved in the join by providing a Spatial Join Hint as follows: + +```scala +var df = df.index(30) //after load or +val df =spark.read.format(...).load(..).index(30) //during load +``` + +Then a join of the form + +```scala + points.join(polygons).where($"point" within $"polygon") // or + + points.join(polygons index 30).where($"point" within $"polygon") +``` + +automatically uses indexes to speed up the join + + +# Developer Channel + +Please visit [Gitter](https://gitter.im/magellan-dev/Lobby?source=orgpage) to discuss Magellan, obtain help from developers or report issues. # Magellan Blog For more details on Magellan and thoughts around Geospatial Analytics and the optimizations chosen for this project, please visit my [blog](https://magellan.ghost.io) diff --git a/build.sbt b/build.sbt index 1a004a6..354c1d2 100644 --- a/build.sbt +++ b/build.sbt @@ -1,6 +1,6 @@ name := "magellan" -version := "1.0.5-SNAPSHOT" +version := "1.0.5" organization := "harsha2010"