Skip to content

Commit

Permalink
v0.2
Browse files Browse the repository at this point in the history
  • Loading branch information
Zeutschler committed Sep 22, 2024
1 parent 0f2a189 commit fcb3103
Show file tree
Hide file tree
Showing 21 changed files with 450 additions and 250 deletions.
152 changes: 36 additions & 116 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,133 +1,53 @@
# DateSpanLib
![GitHub license](https://img.shields.io/github/license/Zeutschler/datespanlib?color=A1C547)
![PyPI version](https://img.shields.io/pypi/v/datespanlib?logo=pypi&logoColor=979DA4&color=A1C547)
![Python versions](https://img.shields.io/badge/dynamic/toml?url=https%3A%2F%2Fraw.githubusercontent.com%2FZeutschler%2Fdatespanlib%2Fmaster%2Fpyproject.toml&query=%24%5B'project'%5D%5B'requires-python'%5D&color=A1C547)
![PyPI Downloads](https://img.shields.io/pypi/dm/datespanlib.svg?logo=pypi&logoColor=979DA4&label=PyPI%20downloads&color=A1C547)
![GitHub last commit](https://img.shields.io/github/last-commit/Zeutschler/datespanlib?logo=github&logoColor=979DA4&color=A1C547)
![unit tests](https://img.shields.io/github/actions/workflow/status/zeutschler/datespanlib/python-package.yml?logo=GitHub&logoColor=979DA4&label=unit%20tests&color=A1C547)
![build](https://img.shields.io/github/actions/workflow/status/zeutschler/datespanlib/python-package.yml?logo=GitHub&logoColor=979DA4&color=A1C547)
![documentation](https://img.shields.io/github/actions/workflow/status/zeutschler/datespanlib/static-site-upload.yml?logo=GitHub&logoColor=979DA4&label=docs&color=A1C547&link=https%3A%2F%2Fzeutschler.github.io%2Fcubedpandas%2F)
![codecov](https://codecov.io/github/Zeutschler/datespanlib/graph/badge.svg?token=B12O0B6F10)
# datespan - convenient data span parsing & handling

**UNDER CONSTRUCTION** - The DateSpanLib library is under active development and in a pre-alpha state, not
suitable for production use and even testing. The library is expected to be released in a first alpha version
in the next weeks.
![GitHub license](https://img.shields.io/github/license/Zeutschler/datespan?color=A1C547)
![PyPI version](https://img.shields.io/pypi/v/datespan?logo=pypi&logoColor=979DA4&color=A1C547)
![PyPI Downloads](https://img.shields.io/pypi/dm/datespan.svg?logo=pypi&logoColor=979DA4&label=PyPI%20downloads&color=A1C547)
![GitHub last commit](https://img.shields.io/github/last-commit/Zeutschler/datespan?logo=github&logoColor=979DA4&color=A1C547)
![unit tests](https://img.shields.io/github/actions/workflow/status/zeutschler/datespan/python-package.yml?logo=GitHub&logoColor=979DA4&label=unit%20tests&color=A1C547)
![build](https://img.shields.io/github/actions/workflow/status/zeutschler/datespan/python-package.yml?logo=GitHub&logoColor=979DA4&color=A1C547)
![codecov](https://codecov.io/github/Zeutschler/datespan/graph/badge.svg?token=B12O0B6F10)

-----------------
A Python library for handling and using data and time spans.
A Python package for convenient **data span** parsing and handling.
Aimed for data analysis and processing, useful in any context requiring date & time spans.

```python
from datespanlib import DateSpan

ds = DateSpan("January to March 2024")
print("2024-04-15" in ds + "1 month") # returns True
```

The DateSpanLib library is designed to be used for data analysis and data processing,
where date and time spans are often used to filter, aggregate or join data. But it
should also be valuable in any other context where date and time spans are used.

It provides dependency free integrations with Pandas, Numpy, Spark and others, can
generate Python code artefacts, either as source text or as precompiled (lambda)
functions and can also generate SQL fragments for filtering in SQL WHERE clauses.

#### Background
The DataSpanLib library has been carved out from the
[CubedPandas](https://github.com/Zeutschler/cubedpandas) project - a library for
intuitive data analysis with Pandas dataframes - as it serves a broader purpose and
can be used independently of CubedPandas.

For internal DateTime parsing and manipulation,
the great [dateutil](https://github.com/dateutil/dateutil) library is used. The
DataSpanLib library has no other dependencies (like Pandas, Numpy Spark etc.),
so it is lightweight and easy to install.

## Installation
The library can be installed via pip or is available as a download on [PyPi.org](https://pypi.org/datespanlib/).
```bash
pip install datespanlib
pip install datespan
```

## Usage

The library provides the following methods and classes:

### Method parse()
The `parse` method converts an arbitrary string into a `DateSpanSet` object. The string can be a simple date
like '2021-01-01' or a complex date span expression like 'Mondays to Wednesday last month'.

### Class DateSpan
`DateSpan` objects represent a single span of time, typically represented by a `start` and `end` datetime.
The `DateSpan` object provides methods to compare, merge, split, shift, expand, intersect etc. with other
`DateSpan` or Python datetime objects.

`DateSpan` objects are 'expansive' in the sense that they resolve the widest possible time span
for the
, e.g. if a `DateSpan` object is created with a start date of '2021-01-01' and an end date of '2021-01-31',




### DateSpanSet - represents an ordered set of DateSpan objects
`DateSpanSet` is an ordered and redundancy free collection of `DateSpan` objects. If e.g. two `DateSpan`
objects in the set would overlap or are contiguous, they are merged into one `DateSpan` object. Aside
set related operations the `DateSpanSet` comes with two special capabilities worth mentioning:

* A build in **interpreter for arbitrary date, time and date span strings**, ranging from simple dates
like '2021-01-01' up to complex date span expressions like 'Mondays to Wednesday last month'.

* Provides methods and can create **artefacts and callables for data processing** with Python, SQL, Pandas
Numpy, Spark and other compatible libraries.




## Basic Usage
```python
from datespanlib import parse, DateSpanSet, DateSpan

# Create a DateSpan object
jan = DateSpan(start='2024-01-01', end='2024-01-31')
feb = DateSpan("February 2024")

jan_feb = DateSpanSet([jan, feb]) # Create a DateSpanSet object
assert(len(jan_feb) == 1) # returns 1, as the consecutive or overlapping DateSpan objects get merged.

assert (jan_feb == parse("January, February 2024")) # Compare DateSpan objects

# Set operations
jan_feb_mar = jan_feb + "1 month"
assert(jan_feb_mar == parse("first 3 month of 2024"))
jan_mar = jan_feb_mar - "Januray 2024"
assert(len(jan_mar)) # returns 2, as the one DateSpans gets split into two DataSpans.
assert(jan_mar.contains("2024-01-15"))

# Use DateSpanSet to filter Pandas DataFrame
import pandas as pd
from datespan import parse, DateSpan
df = pd.DataFrame({"date": pd.date_range("2024-01-01", "2024-12-31")})
result = df[df["date"].apply(jan_mar.contains)] # don't use this, slow
result = jan_mar.filter(df, "date") # fast vectorized operation

# Use DateSpanSet to filter Spark DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pd.DataFrame({"date": pd.date_range("2024-01-01", "2024-12-31")}))
result = jan_mar.filter(df, "date") # fast vectorized/distributed operation

# Use DateSpanSet to filter Numpy array
import numpy as np
arr = np.arange(np.datetime64("2024-01-01"), np.datetime64("2024-12-31"))
result = jan_mar.filter(arr) # fast vectorized operation
dss = parse("April 2024 ytd") # Create a DateSpanSet object
dss.add("May") # Add a full month of the current year (e.g. 2024 in 2024)
dss.add("today") # Add the current day from 00:00:00 to 23:59:59
dss += "previous week" # Add a full week from Monday 00:00:00 to Sunday 23:59
dss -= "January" # Remove the full month of January 2024

# Use DateSpanSet to create an SQL WHERE statement
sql = f"SELECT * FROM table WHERE {jan_mar.to_sql('date')}"
print(len(dss)) # returns the number of nonconsecutive DateSpans
print(dss.to_sql("date")) # returns an SQL WHERE clause fragment
print(dss.filter(df, "date")) # returns filtered DataFrame # vectorized filtering of column 'date' of a DataFrame
```

### Classes
`DateSpan` represents a single date or time span, defined by a start and an end datetime.
Provides methods to create, compare, merge, parse, split, shift, expand & intersect
`DateSpan` objects and /or `datetime`, `date`or `time` objects.

`DateSpanSet` represents an ordered and redundancy free collection of `DateSpan` objects,
where consecutive or overlapping `DateSpan` objects get automatically merged into a single `DateSpan`
object. Required for fragmented date span expressions like `every 2nd Friday of next month`.

`DateSpanParser` provides parsing for arbitrary date, time and date span strings in english language,
ranging from simple dates like '2021-01-01' up to complex date span expressions like
'Mondays to Wednesday last month'. For internal DateTime parsing and manipulation, the
[DateUtil]() library is used.






### Classes
The 'dataspan' package has been carved out from the
[CubedPandas](https://github.com/Zeutschler/cubedpandas) project - a library for
data analysis with Pandas dataframes - as DataSpan serves a broader purpose and
can be used independently of CubedPandas.
12 changes: 6 additions & 6 deletions datespanlib/__init__.py → datespan/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# DateSpanLib - Copyright (c)2024, Thomas Zeutschler, MIT license
# datespan - Copyright (c)2024, Thomas Zeutschler, MIT license

from __future__ import annotations
from dateutil.parser import parserinfo

from datespanlib.date_span import DateSpan
from datespanlib.date_span_set import DateSpanSet
from datespan.date_span import DateSpan
from datespan.date_span_set import DateSpanSet

__author__ = "Thomas Zeutschler"
__version__ = "0.1.8"
__version__ = "0.2.0"
__license__ = "MIT"
VERSION = __version__

Expand All @@ -20,7 +20,7 @@
]


def parse(datespan_text: str, language: str | None = "en", parser_info: parserinfo | None = None) -> DateSpanSet:
def parse(datespan_text: str, parser_info: parserinfo | None = None) -> DateSpanSet:
"""
Creates a new DateSpanSet instance and parses the given text into a set of DateSpan objects.
Expand All @@ -37,4 +37,4 @@ def parse(datespan_text: str, language: str | None = "en", parser_info: parserin
>>> DateSpanSet.evaluate('last month') # if today would be in February 2024
DateSpanSet([DateSpan(datetime.datetime(2024, 1, 1, 0, 0), datetime.datetime(2024, 1, 31, 23, 59, 59, 999999))])
"""
return DateSpanSet(datespan_text, language, parser_info)
return DateSpanSet(definition=datespan_text, parser_info=parser_info)
127 changes: 92 additions & 35 deletions datespanlib/date_span.py → datespan/date_span.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
# DateSpanLib - Copyright (c)2024, Thomas Zeutschler, MIT license
# datespan - Copyright (c)2024, Thomas Zeutschler, MIT license

from __future__ import annotations
from datetime import datetime, time, timedelta
from dateutil.relativedelta import relativedelta
from dateutil.relativedelta import MO


class DateSpan:
"""
Represents a time span with a start and end date. The DateSpan can be used to compare, merge, intersect, subtract
Expand All @@ -14,18 +13,44 @@ class DateSpan:
The DateSpan is immutable, all methods that change the DateSpan will return a new DateSpan.
"""
TIME_EPSILON_MICROSECONDS = 100_000 # 0.1 seconds
"""The time epsilon in microseconds used for comparison of time deltas."""
MIN_YEAR = 1700
"""The time epsilon in microseconds used for detecting overlapping or consecutive date time spans."""
MIN_YEAR = datetime.min.year
"""The minimum year that can be represented by the DateSpan."""
MAX_YEAR = 2300
MAX_YEAR = datetime.max.year
"""The maximum year that can be represented by the DateSpan."""

def __init__(self, start: datetime | None = None, end: datetime | None = None, message: str | None = None):
self._start: datetime | None = start
self._end: datetime | None = end if end is not None else start
self._start, self._end = self._swap()
def __init__(self, start = None, end = None, message: str | None = None):
"""
Initializes a new DateSpan with the given start and end date. If only one date is given, the DateSpan will
represent a single point in time. If no date is given, the DateSpan will be undefined.
If `start` and `end` are datetime objects, the DateSpan will be initialized with these datetimes.
If `start` is larger than `end`, the dates will be automatically swapped.
If `start` and/or `end` contains arbitrary date span text, the text will be parsed into a DateSpan.
If both `start` and `end` contain text that refer/resolve to distinct date span, then the resulting
DateSpan will start at the beginning of the first date span defined by `start` and the end at the end of the
second date span defined by `end`.
Raises:
ValueError: If arguments of the DateSpan are invalid, the DateSpan could not be parsed or the
parsing of the DateSpan would result in more than one DateSpan. For such cases use the DateSpanSet
class to parse multipart date spans.
"""
self._arg_start = start
self._arg_end = end
self._message: str | None = message

if isinstance(start, datetime | None) and isinstance(end, datetime | None):
self._start: datetime | None = start
self._end: datetime | None = end if end is not None else start
self._start, self._end = self._swap()
else:
try:
self._start, self._end = self._parse(start, end)
except ValueError as e:
raise e

@property
def message(self) -> str | None:
"""Returns the message of the DateSpan."""
Expand Down Expand Up @@ -77,9 +102,7 @@ def overlaps_with(self, other: DateSpan) -> bool:
"""
if self.is_undefined or other.is_undefined:
return False
if self._start >= other._start:
return self._start <= other._end
return self._end >= other._start
return max(self._start, other._start) <= min(self._end, other._end)

def consecutive_with(self, other: DateSpan) -> bool:
"""
Expand Down Expand Up @@ -118,6 +141,17 @@ def merge(self, other: DateSpan) -> DateSpan:
return DateSpan(min(self._start, other._start), max(self._end, other._end))
raise ValueError("Cannot merge DateSpans that do not overlap or are not consecutive.")

def can_merge(self, other: DateSpan) -> bool:
"""
Returns True if the DateSpan can be merged with the given DateSpan.
"""
if self.is_undefined or other.is_undefined:
return True
return self.overlaps_with(other) or self.consecutive_with(other)




def intersect(self, other: DateSpan) -> DateSpan:
"""
Returns a new DateSpan that is the intersection of the DateSpan with the given DateSpan.
Expand Down Expand Up @@ -170,7 +204,7 @@ def subtract(self, other: DateSpan, allow_split: bool = False) -> DateSpan | (Da
return self.clone()

if other._start < self._start:
# overalap at the start
# overlap at the start
return DateSpan(other._end + timedelta(microseconds=1), self._end)
# overlap at the end
return DateSpan(self._start, other._start - timedelta(microseconds=1))
Expand Down Expand Up @@ -449,18 +483,6 @@ def is_full_day(self) -> bool:
return (self._start == self._begin_of_day(self._start) and
self._end == self._end_of_day(self._end))


def _swap(self) -> DateSpan:
"""Swap start and end date if start is greater than end."""
if self._start is None or self._end is None:
return self

if self._start > self._end:
tmp = self._start
self._start = self._end
self._end = tmp
return self

def replace(self, year: int | None = None, month: int | None = None, day: int | None = None,
hour: int | None = None,
minute: int | None = None, second: int | None = None, microsecond: int | None = None) -> DateSpan:
Expand Down Expand Up @@ -935,16 +957,9 @@ def __str__(self):
if self.is_undefined:
return "DateSpan(undefined)"

if self._start.microsecond != 0:
start = f"{self._start.strftime('%a %Y-%m-%d %H:%M:%S.%f')}"
else:
start = f"{self._start.strftime('%a %Y-%m-%d %H:%M:%S')}"

if self._end.microsecond != 0:
end = f"{self._end.strftime('%a %Y-%m-%d %H:%M:%S.%f')})"
else:
end = f"{self._end.strftime('%a %Y-%m-%d %H:%M:%S')})"
return (f"DateSpan({start} <-> {end})")
start = f"'{self._arg_start}'" if isinstance(self._arg_start, str) else str(self._arg_start)
end = f"'{self._arg_end}'" if isinstance(self._arg_end, str) else str(self._arg_end)
return f"DateSpan({start}, {end})" # -> ('start': {self._start}, 'end': {self._end})"

def __repr__(self):
return self.__str__()
Expand Down Expand Up @@ -1024,3 +1039,45 @@ def __le__(self, other):
def __hash__(self):
return hash((self._start, self._end))
# endregion

# region private methods
def _swap(self) -> DateSpan:
"""Swap start and end date if start is greater than end."""
if self._start is None or self._end is None:
return self

if self._start > self._end:
tmp = self._start
self._start = self._end
self._end = tmp
return self

def _parse(self, start, end = None) -> (datetime, datetime):
"""Parse a date span string."""
if end is None:
expected_spans = 1
text = start
else:
expected_spans = 2
text = f"{start}; {end}" # merge start and end into a single date span statement

self._message = None
try:
from datespan.parser.datespanparser import DateSpanParser # overcome circular import
date_span_parser: DateSpanParser = DateSpanParser(text)
expressions = date_span_parser.parse() # todo: inject self.parser_info
if len(expressions) != expected_spans:
raise ValueError(f"The date span expression '{text}' resolves to "
f"more than just a single date span. "
f"Use 'DateSpanSet('{text}')' to parse multi-part date spans.")
if expected_spans == 2:
start = expressions[0][0][0]
end = expressions[1][0][1]
else:
start = expressions[0][0][0]
end = expressions[0][0][1]

return start, end
except Exception as e:
self._message = str(e)
raise ValueError(str(e))
Loading

0 comments on commit fcb3103

Please sign in to comment.