Skip to content

Commit

Permalink
[SEDONA-704] Add Stac Python Wrapper for STAC Reader (#1793)
Browse files Browse the repository at this point in the history
* [SEDONA-704] Add Stac Python Wrapper for STAC Reader

* update schema

* downgrade pystac version so python 3.7 runtime is supported
  • Loading branch information
zhangfengcdt authored Feb 6, 2025
1 parent 1260245 commit 70d2e51
Show file tree
Hide file tree
Showing 8 changed files with 1,028 additions and 0 deletions.
150 changes: 150 additions & 0 deletions docs/api/sql/Stac.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,156 @@ In this example, the data source will push down the temporal filter to the under

In this example, the data source will push down the spatial filter to the underlying data source.

# Python API

The Python API allows you to interact with a SpatioTemporal Asset Catalog (STAC) API using the Client class. This class provides methods to open a connection to a STAC API, retrieve collections, and search for items with various filters.

## Client Class

## Methods

### `open(url: str) -> Client`

Opens a connection to the specified STAC API URL.

**Parameters:**

- `url` (*str*): The URL of the STAC API to connect to.
**Example:** `"https://planetarycomputer.microsoft.com/api/stac/v1"`

**Returns:**

- `Client`: An instance of the `Client` class connected to the specified URL.

---

### `get_collection(collection_id: str) -> CollectionClient`

Retrieves a collection client for the specified collection ID.

**Parameters:**

- `collection_id` (*str*): The ID of the collection to retrieve.
**Example:** `"aster-l1t"`

**Returns:**

- `CollectionClient`: An instance of the `CollectionClient` class for the specified collection.

---

### `search(*ids: Union[str, list], collection_id: str, bbox: Optional[list] = None, datetime: Optional[Union[str, datetime.datetime, list]] = None, max_items: Optional[int] = None, return_dataframe: bool = True) -> Union[Iterator[PyStacItem], DataFrame]`

Searches for items in the specified collection with optional filters.

**Parameters:**

- `ids` (*Union[str, list]*): A variable number of item IDs to filter the items.
**Example:** `"item_id1"` or `["item_id1", "item_id2"]`
- `collection_id` (*str*): The ID of the collection to search in.
**Example:** `"aster-l1t"`
- `bbox` (*Optional[list]*): A list of bounding boxes for filtering the items. Each bounding box is represented as a list of four float values: `[min_lon, min_lat, max_lon, max_lat]`.
**Example:** `[[ -180.0, -90.0, 180.0, 90.0 ]]`
- `datetime` (*Optional[Union[str, datetime.datetime, list]]*): A single datetime, RFC 3339-compliant timestamp, or a list of date-time ranges for filtering the items.
**Example:**
- `"2020-01-01T00:00:00Z"`
- `datetime.datetime(2020, 1, 1)`
- `[["2020-01-01T00:00:00Z", "2021-01-01T00:00:00Z"]]`
- `max_items` (*Optional[int]*): The maximum number of items to return from the search, even if there are more matching results.
**Example:** `100`
- `return_dataframe` (*bool*): If `True` (default), return the result as a Spark DataFrame instead of an iterator of `PyStacItem` objects.
**Example:** `True`

**Returns:**

- *Union[Iterator[PyStacItem], DataFrame]*: An iterator of `PyStacItem` objects or a Spark DataFrame that matches the specified filters.

## Sample Code

### Initialize the Client

```python
from sedona.stac.client import Client

# Initialize the client
client = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
```

### Search Items on a Collection Within a Year

```python
items = client.search(
collection_id="aster-l1t",
datetime="2020",
return_dataframe=False
)
```

### Search Items on a Collection Within a Month and Max Items

```python
items = client.search(
collection_id="aster-l1t",
datetime="2020-05",
return_dataframe=False,
max_items=5
)
```

### Search Items with Bounding Box and Interval

```python
items = client.search(
collection_id="aster-l1t",
ids=["AST_L1T_00312272006020322_20150518201805"],
bbox=[-180.0, -90.0, 180.0, 90.0],
datetime=["2006-01-01T00:00:00Z", "2007-01-01T00:00:00Z"],
return_dataframe=False
)
```

### Search Multiple Items with Multiple Bounding Boxes

```python
bbox_list = [
[-180.0, -90.0, 180.0, 90.0],
[-100.0, -50.0, 100.0, 50.0]
]
items = client.search(
collection_id="aster-l1t",
bbox=bbox_list,
return_dataframe=False
)
```

### Search Items and Get DataFrame as Return with Multiple Intervals

```python
interval_list = [
["2020-01-01T00:00:00Z", "2020-06-01T00:00:00Z"],
["2020-07-01T00:00:00Z", "2021-01-01T00:00:00Z"]
]
df = client.search(
collection_id="aster-l1t",
datetime=interval_list,
return_dataframe=True
)
df.show()
```

### Save Items in DataFrame to GeoParquet with Both Bounding Boxes and Intervals

```python
# Save items in DataFrame to GeoParquet with both bounding boxes and intervals
client.get_collection("aster-l1t").save_to_geoparquet(
output_path="/path/to/output",
bbox=bbox_list,
datetime="2020-05"
)
```

These examples demonstrate how to use the Client class to search for items in a STAC collection with various filters and return the results as either an iterator of PyStacItem objects or a Spark DataFrame.

# References

- STAC Specification: https://stacspec.org/
Expand Down
1 change: 1 addition & 0 deletions python/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ attrs="*"
pyarrow="*"
keplergl = "==0.3.2"
pydeck = "===0.8.0"
pystac = "===1.5.0"
rasterio = ">=1.2.10"

[requires]
Expand Down
16 changes: 16 additions & 0 deletions python/sedona/stac/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
112 changes: 112 additions & 0 deletions python/sedona/stac/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from typing import Union, Optional, Iterator

from sedona.stac.collection_client import CollectionClient

import datetime as python_datetime
from pystac import Item as PyStacItem

from pyspark.sql import DataFrame


class Client:
def __init__(self, url: str):
self.url = url

@classmethod
def open(cls, url: str):
"""
Opens a connection to the specified STAC API URL.
This class method creates an instance of the Client class with the given URL.
Parameters:
- url (str): The URL of the STAC API to connect to.
Example: "https://planetarycomputer.microsoft.com/api/stac/v1"
Returns:
- Client: An instance of the Client class connected to the specified URL.
"""
return cls(url)

def get_collection(self, collection_id: str):
"""
Retrieves a collection client for the specified collection ID.
This method creates an instance of the CollectionClient class for the given collection ID,
allowing interaction with the specified collection in the STAC API.
Parameters:
- collection_id (str): The ID of the collection to retrieve.
Example: "aster-l1t"
Returns:
- CollectionClient: An instance of the CollectionClient class for the specified collection.
"""
return CollectionClient(self.url, collection_id)

def search(
self,
*ids: Union[str, list],
collection_id: str,
bbox: Optional[list] = None,
datetime: Optional[Union[str, python_datetime.datetime, list]] = None,
max_items: Optional[int] = None,
return_dataframe: bool = True,
) -> Union[Iterator[PyStacItem], DataFrame]:
"""
Searches for items in the specified collection with optional filters.
Parameters:
- ids (Union[str, list]): A variable number of item IDs to filter the items.
Example: "item_id1" or ["item_id1", "item_id2"]
- collection_id (str): The ID of the collection to search in.
Example: "aster-l1t"
- bbox (Optional[list]): A list of bounding boxes for filtering the items.
Each bounding box is represented as a list of four float values: [min_lon, min_lat, max_lon, max_lat].
Example: [[-180.0, -90.0, 180.0, 90.0]] # This bounding box covers the entire world.
- datetime (Optional[Union[str, python_datetime.datetime, list]]): A single datetime, RFC 3339-compliant timestamp,
or a list of date-time ranges for filtering the items. The datetime can be specified in various formats:
- "YYYY" expands to ["YYYY-01-01T00:00:00Z", "YYYY-12-31T23:59:59Z"]
- "YYYY-mm" expands to ["YYYY-mm-01T00:00:00Z", "YYYY-mm-<last_day>T23:59:59Z"]
- "YYYY-mm-dd" expands to ["YYYY-mm-ddT00:00:00Z", "YYYY-mm-ddT23:59:59Z"]
- "YYYY-mm-ddTHH:MM:SSZ" remains as ["YYYY-mm-ddTHH:MM:SSZ", "YYYY-mm-ddTHH:MM:SSZ"]
- A list of date-time ranges can be provided for multiple intervals.
Example: "2020-01-01T00:00:00Z" or python_datetime.datetime(2020, 1, 1) or [["2020-01-01T00:00:00Z", "2021-01-01T00:00:00Z"]]
- max_items (Optional[int]): The maximum number of items to return from the search, even if there are more matching results.
Example: 100
- return_dataframe (bool): If True, return the result as a Spark DataFrame instead of an iterator of PyStacItem objects.
Example: True
Returns:
- Union[Iterator[PyStacItem], DataFrame]: An iterator of PyStacItem objects or a Spark DataFrame that match the specified filters.
"""
client = self.get_collection(collection_id)
if return_dataframe:
return client.get_dataframe(
*ids, bbox=bbox, datetime=datetime, max_items=max_items
)
else:
return client.get_items(
*ids, bbox=bbox, datetime=datetime, max_items=max_items
)
Loading

0 comments on commit 70d2e51

Please sign in to comment.