Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disable cache #234

Merged
merged 65 commits into from
Jul 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
cb8e025
Merge pull request #219 from Quantco/docs_improvements
MariusMerkleQC Apr 3, 2024
246bb47
Edit. (#220)
kklein Apr 3, 2024
723c46d
Pre-commit autoupdate (#221)
quant-ranger[bot] May 6, 2024
17b8ab2
Bump google-github-actions/auth from 2.1.2 to 2.1.3 (#222)
dependabot[bot] May 21, 2024
efc54e3
Bump mamba-org/setup-micromamba from 1.8.1 to 1.9.0 (#223)
dependabot[bot] May 27, 2024
9ea8866
adding configuration options to uniques functionality
SimonLangerQC Jun 3, 2024
92f2933
improve docstrings
SimonLangerQC Jun 4, 2024
63e9634
move util_ functions to datajudge.utils
SimonLangerQC Jun 4, 2024
3782955
updates following comments
SimonLangerQC Jun 5, 2024
6cad13e
add configuration options to functional dependency checks, and utilit…
SimonLangerQC Jun 5, 2024
7dec26e
fix typo in run_integration_tests_postgres.sh
SimonLangerQC Jun 5, 2024
6322310
rename to output_processor
SimonLangerQC Jun 5, 2024
308ff99
output_processor only
SimonLangerQC Jun 5, 2024
c1fec1a
allow for single output processor
SimonLangerQC Jun 5, 2024
91ea51c
add output_processor_limit
SimonLangerQC Jun 6, 2024
0f94589
Docs update
SimonLangerQC Jun 7, 2024
693b29b
Docs update
SimonLangerQC Jun 7, 2024
5c0c03b
Docs update
SimonLangerQC Jun 7, 2024
52f993d
Docs update
SimonLangerQC Jun 7, 2024
887b0e6
Docs update
SimonLangerQC Jun 7, 2024
0699ddb
Docs update
SimonLangerQC Jun 7, 2024
3ef980e
Docs update
SimonLangerQC Jun 7, 2024
13d866f
Docs update
SimonLangerQC Jun 7, 2024
582af61
Docs update
SimonLangerQC Jun 7, 2024
a522268
Docs update
SimonLangerQC Jun 7, 2024
0c87d34
Docs update
SimonLangerQC Jun 7, 2024
cdd6e1f
Docs update
SimonLangerQC Jun 7, 2024
658c8ac
Docs update
SimonLangerQC Jun 7, 2024
b5c1a1f
Docs update
SimonLangerQC Jun 7, 2024
f40c5c0
Docs update
SimonLangerQC Jun 7, 2024
151d53b
Docs update
SimonLangerQC Jun 7, 2024
2f99478
Docs update
SimonLangerQC Jun 7, 2024
ea326ad
Docs update
SimonLangerQC Jun 7, 2024
c712205
Docs update
SimonLangerQC Jun 7, 2024
9eb3433
update doc string on null columns everywhere and fix typo
SimonLangerQC Jun 7, 2024
e6c396a
Update docs
SimonLangerQC Jun 7, 2024
3ca2003
Update docs
SimonLangerQC Jun 7, 2024
4ddda10
Update docs
SimonLangerQC Jun 7, 2024
0502720
docs updates
SimonLangerQC Jun 7, 2024
cf42e38
update docs
SimonLangerQC Jun 7, 2024
409b611
filternull docs clarification
SimonLangerQC Jun 7, 2024
536096a
replace assert by raise ValueError
SimonLangerQC Jun 7, 2024
0067b84
shorten name to apply_output_formatting
SimonLangerQC Jun 10, 2024
143f0f9
add unit tests for new utils functions
SimonLangerQC Jun 10, 2024
b8842a7
set default to limit 100 elements
SimonLangerQC Jun 10, 2024
0041e99
ensure all relevant tests run for impala and ensure they pass
SimonLangerQC Jun 11, 2024
cb63bef
disable extralong test for bigquery due to slow speed
SimonLangerQC Jun 11, 2024
62f6877
capitalization test handle parallel if table already created
SimonLangerQC Jun 11, 2024
72ffd10
Merge pull request #224 from Quantco/uniques_improvements
SimonLangerQC Jun 12, 2024
6f433e2
add lru_cache_maxsize parameter to each constraint
SimonLangerQC Jun 27, 2024
fbea40e
add optional wrappers
SimonLangerQC Jun 27, 2024
a310c51
Merge branch 'main' into allow-disable-cache
SimonLangerQC Jun 27, 2024
83c72ac
small fixes
SimonLangerQC Jun 28, 2024
4cbf913
update docs
SimonLangerQC Jun 28, 2024
38f73cc
remove further merge issues
SimonLangerQC Jun 28, 2024
f14a54b
rename to cache_size and add memray dependencies to pixi.toml
SimonLangerQC Jun 28, 2024
c5e51d1
memray only for test envs
SimonLangerQC Jun 28, 2024
51b2752
Update src/datajudge/constraints/base.py
SimonLangerQC Jun 28, 2024
bd6c577
Update src/datajudge/constraints/base.py
SimonLangerQC Jun 28, 2024
72da33c
Update tests/integration/test_integration.py
SimonLangerQC Jun 28, 2024
ab0946b
make memory testcase easier to understand
SimonLangerQC Jun 28, 2024
d097a4d
add query collector to memory test case
SimonLangerQC Jun 28, 2024
6e91b19
Update src/datajudge/constraints/row.py
SimonLangerQC Jun 28, 2024
a5c8d81
Update src/datajudge/constraints/row.py
SimonLangerQC Jun 28, 2024
eb6fb9a
add comment to output processor limit unittests
SimonLangerQC Jun 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/actions/pytest/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ runs:
run: |
flit install -s
if [[ "${{ inputs.backend }}" != "" ]]; then
pytest --cov=datajudge --cov-report=xml --cov-append --backend=${{ inputs.backend }} ${{ inputs.args }}
pytest --verbose --cov=datajudge --cov-report=xml --cov-append --backend=${{ inputs.backend }} ${{ inputs.args }}
else
pytest --cov=datajudge --cov-report=xml --cov-append ${{ inputs.args }}
pytest --verbose --cov=datajudge --cov-report=xml --cov-append ${{ inputs.args }}
fi
- name: Generate code coverage report
uses: codecov/codecov-action@v3.1.3
Expand Down
1,189 changes: 1,189 additions & 0 deletions pixi.lock

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,10 @@ sqlalchemy = "2.*"
pytest-cov = "*"
pytest-xdist = "*"

[feature.test.target.unix.dependencies]
pytest-memray = "*"
memray = "*"

[feature.mypy.dependencies]
mypy = "*"
types-setuptools = "*"
Expand Down
19 changes: 17 additions & 2 deletions src/datajudge/constraints/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,10 @@ class Constraint(abc.ABC):
In order to obtain such values, the `retrieve` method defines a mapping from DataReference,
be it the DataReference of primary interest, `ref`, or a baseline DataReference, `ref2`, to
value. If `ref_value` is already provided, usually no further mapping needs to be taken care of.

By default, retrieved arguments are cached indefinitely ``@lru_cache(maxsize=None)``.
This can be controlled by setting the `cache_size` argument to a different value.
``0`` disables caching.
"""

def __init__(
Expand All @@ -123,6 +127,7 @@ def __init__(
output_processors: Optional[
Union[OutputProcessor, List[OutputProcessor]]
] = output_processor_limit,
cache_size=None,
):
self._check_if_valid_between_or_within(ref2, ref_value)
self.ref = ref
Expand All @@ -140,6 +145,16 @@ def __init__(
output_processors = [output_processors]
self.output_processors = output_processors

self.cache_size = cache_size
self._setup_caching()

def _setup_caching(self):
# this has an added benefit of allowing the class to be garbage collected
# according to https://rednafi.com/python/lru_cache_on_methods/
# and https://docs.astral.sh/ruff/rules/cached-instance-method/
self.get_factual_value = lru_cache(self.cache_size)(self.get_factual_value) # type: ignore[method-assign]
self.get_target_value = lru_cache(self.cache_size)(self.get_target_value) # type: ignore[method-assign]

def _check_if_valid_between_or_within(
self, ref2: Optional[DataReference], ref_value: Optional[Any]
):
Expand All @@ -156,13 +171,13 @@ def _check_if_valid_between_or_within(
f"{class_name}. Use exactly either of them."
)

@lru_cache(maxsize=None)
# @lru_cache(maxsize=None), see _setup_caching()
def get_factual_value(self, engine: sa.engine.Engine) -> Any:
factual_value, factual_selections = self.retrieve(engine, self.ref)
self.factual_selections = factual_selections
return factual_value

@lru_cache(maxsize=None)
# @lru_cache(maxsize=None), see _setup_caching()
def get_target_value(self, engine: sa.engine.Engine) -> Any:
if self.ref2 is None:
return self.ref_value
Expand Down
17 changes: 14 additions & 3 deletions src/datajudge/constraints/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,13 @@ def retrieve(

class ColumnExistence(Column):
def __init__(
self, ref: DataReference, columns: List[str], name: Optional[str] = None
self,
ref: DataReference,
columns: List[str],
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref_value=columns, name=name)
super().__init__(ref, ref_value=columns, name=name, cache_size=cache_size)

def compare(
self, column_names_factual: List[str], column_names_target: List[str]
Expand Down Expand Up @@ -86,8 +90,15 @@ def __init__(
ref2: Optional[DataReference] = None,
column_type: Optional[Union[str, sa.types.TypeEngine]] = None,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref2=ref2, ref_value=column_type, name=name)
super().__init__(
ref,
ref2=ref2,
ref_value=column_type,
name=name,
cache_size=cache_size,
)
self.column_type = column_type

def retrieve(
Expand Down
21 changes: 18 additions & 3 deletions src/datajudge/constraints/date.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ def __init__(
use_lower_bound_reference: bool,
column_type: str,
name: Optional[str] = None,
cache_size=None,
*,
ref2: Optional[DataReference] = None,
min_value: Optional[str] = None,
Expand All @@ -48,7 +49,13 @@ def __init__(
min_date: Optional[dt.date] = None
if min_value is not None:
min_date = dt.datetime.strptime(min_value, INPUT_DATE_FORMAT).date()
super().__init__(ref, ref2=ref2, ref_value=min_date, name=name)
super().__init__(
ref,
ref2=ref2,
ref_value=min_date,
name=name,
cache_size=cache_size,
)

def retrieve(
self, engine: sa.engine.Engine, ref: DataReference
Expand Down Expand Up @@ -85,6 +92,7 @@ def __init__(
use_upper_bound_reference: bool,
column_type: str,
name: Optional[str] = None,
cache_size=None,
*,
ref2: Optional[DataReference] = None,
max_value: Optional[str] = None,
Expand All @@ -94,7 +102,13 @@ def __init__(
max_date: Optional[dt.date] = None
if max_value is not None:
max_date = dt.datetime.strptime(max_value, INPUT_DATE_FORMAT).date()
super().__init__(ref, ref2=ref2, ref_value=max_date, name=name)
super().__init__(
ref,
ref2=ref2,
ref_value=max_date,
name=name,
cache_size=cache_size,
)

def retrieve(
self, engine: sa.engine.Engine, ref: DataReference
Expand Down Expand Up @@ -133,8 +147,9 @@ def __init__(
lower_bound: str,
upper_bound: str,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref_value=min_fraction, name=name)
super().__init__(ref, ref_value=min_fraction, name=name, cache_size=cache_size)
self.lower_bound = lower_bound
self.upper_bound = upper_bound

Expand Down
1 change: 1 addition & 0 deletions src/datajudge/constraints/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ def __init__(
aggregation_column: str,
start_value: int = 0,
name: Optional[str] = None,
cache_size=None,
*,
tolerance: float = 0,
ref2: Optional[DataReference] = None,
Expand Down
5 changes: 5 additions & 0 deletions src/datajudge/constraints/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def __init__(
end_columns: List[str],
max_relative_n_violations: float,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref_value=object(), name=name)
self.key_columns = key_columns
Expand Down Expand Up @@ -74,6 +75,7 @@ def __init__(
max_relative_n_violations: float,
end_included: bool,
name: Optional[str] = None,
cache_size=None,
):
self.end_included = end_included
super().__init__(
Expand All @@ -83,6 +85,7 @@ def __init__(
end_columns,
max_relative_n_violations,
name=name,
cache_size=cache_size,
)

def select(self, engine: sa.engine.Engine, ref: DataReference):
Expand Down Expand Up @@ -113,6 +116,7 @@ def __init__(
max_relative_n_violations: float,
legitimate_gap_size: float,
name: Optional[str] = None,
cache_size=None,
):
self.legitimate_gap_size = legitimate_gap_size
super().__init__(
Expand All @@ -122,6 +126,7 @@ def __init__(
end_columns,
max_relative_n_violations,
name=name,
cache_size=cache_size,
)

@abc.abstractmethod
Expand Down
20 changes: 17 additions & 3 deletions src/datajudge/constraints/miscs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@


class PrimaryKeyDefinition(Constraint):
def __init__(self, ref, primary_keys: List[str], name: Optional[str] = None):
def __init__(
self,
ref,
primary_keys: List[str],
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref_value=set(primary_keys), name=name)

def retrieve(
Expand Down Expand Up @@ -56,6 +62,7 @@ def __init__(
max_absolute_n_duplicates: int = 0,
infer_pk_columns: bool = False,
name: Optional[str] = None,
cache_size=None,
):
if max_duplicate_fraction != 0 and max_absolute_n_duplicates != 0:
raise ValueError(
Expand All @@ -71,7 +78,7 @@ def __init__(
ref_value = ("relative", 0)

self.infer_pk_columns = infer_pk_columns
super().__init__(ref, ref_value=ref_value, name=name)
super().__init__(ref, ref_value=ref_value, name=name, cache_size=cache_size)

def test(self, engine: sa.engine.Engine) -> TestResult:
if self.infer_pk_columns and db_access.is_bigquery(engine):
Expand Down Expand Up @@ -152,8 +159,15 @@ def __init__(
max_null_fraction: Optional[float] = None,
max_relative_deviation: float = 0,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref2=ref2, ref_value=max_null_fraction, name=name)
super().__init__(
ref,
ref2=ref2,
ref_value=max_null_fraction,
name=name,
cache_size=cache_size,
)
if max_null_fraction is not None and not (0 <= max_null_fraction <= 1):
raise ValueError(
f"max_null_fraction was expected to lie within [0, 1] but is "
Expand Down
18 changes: 14 additions & 4 deletions src/datajudge/constraints/nrows.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,15 @@ def __init__(
ref2: Optional[DataReference] = None,
n_rows: Optional[int] = None,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref2=ref2, ref_value=n_rows, name=name)
super().__init__(
ref,
ref2=ref2,
ref_value=n_rows,
name=name,
cache_size=cache_size,
)

def retrieve(
self, engine: sa.engine.Engine, ref: DataReference
Expand Down Expand Up @@ -85,8 +92,9 @@ def __init__(
ref2: DataReference,
max_relative_loss_getter: ToleranceGetter,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref2=ref2, name=name)
super().__init__(ref, ref2=ref2, name=name, cache_size=cache_size)
self.max_relative_loss_getter = max_relative_loss_getter

def compare(self, n_rows_factual: int, n_rows_target: int) -> Tuple[bool, str]:
Expand Down Expand Up @@ -116,8 +124,9 @@ def __init__(
ref2: DataReference,
max_relative_gain_getter: ToleranceGetter,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref2=ref2, name=name)
super().__init__(ref, ref2=ref2, name=name, cache_size=cache_size)
self.max_relative_gain_getter = max_relative_gain_getter

def compare(self, n_rows_factual: int, n_rows_target: int) -> Tuple[bool, str]:
Expand Down Expand Up @@ -147,8 +156,9 @@ def __init__(
ref2: DataReference,
min_relative_gain_getter: ToleranceGetter,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref2=ref2, name=name)
super().__init__(ref, ref2=ref2, name=name, cache_size=cache_size)
self.min_relative_gain_getter = min_relative_gain_getter

def compare(self, n_rows_factual: int, n_rows_target: int) -> Tuple[bool, str]:
Expand Down
18 changes: 16 additions & 2 deletions src/datajudge/constraints/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,18 @@ def __init__(
self,
ref: DataReference,
name: Optional[str] = None,
cache_size=None,
*,
ref2: Optional[DataReference] = None,
min_value: Optional[float] = None,
):
super().__init__(ref, ref2=ref2, ref_value=min_value, name=name)
super().__init__(
ref,
ref2=ref2,
ref_value=min_value,
name=name,
cache_size=cache_size,
)

def retrieve(
self, engine: sa.engine.Engine, ref: DataReference
Expand Down Expand Up @@ -46,6 +53,7 @@ def __init__(
self,
ref: DataReference,
name: Optional[str] = None,
cache_size=None,
*,
ref2: Optional[DataReference] = None,
max_value: Optional[float] = None,
Expand All @@ -55,6 +63,7 @@ def __init__(
ref2=ref2,
ref_value=max_value,
name=name,
cache_size=cache_size,
)

def retrieve(
Expand Down Expand Up @@ -87,8 +96,9 @@ def __init__(
lower_bound: float,
upper_bound: float,
name: Optional[str] = None,
cache_size=None,
):
super().__init__(ref, ref_value=min_fraction, name=name)
super().__init__(ref, ref_value=min_fraction, name=name, cache_size=cache_size)
self.lower_bound = lower_bound
self.upper_bound = upper_bound

Expand Down Expand Up @@ -123,6 +133,7 @@ def __init__(
ref: DataReference,
max_absolute_deviation: float,
name: Optional[str] = None,
cache_size=None,
*,
ref2: Optional[DataReference] = None,
mean_value: Optional[float] = None,
Expand All @@ -132,6 +143,7 @@ def __init__(
ref2=ref2,
ref_value=mean_value,
name=name,
cache_size=cache_size,
)
self.max_absolute_deviation = max_absolute_deviation

Expand Down Expand Up @@ -169,6 +181,7 @@ def __init__(
max_absolute_deviation: Optional[float] = None,
max_relative_deviation: Optional[float] = None,
name: Optional[str] = None,
cache_size=None,
*,
ref2: Optional[DataReference] = None,
expected_percentile: Optional[float] = None,
Expand All @@ -178,6 +191,7 @@ def __init__(
ref2=ref2,
ref_value=expected_percentile,
name=name,
cache_size=cache_size,
)
if not (0 <= percentage <= 100):
raise ValueError(
Expand Down
Loading
Loading