To get started, we will create a sample database using sqlite that contains a list of companies.
The table "companies_archive" contains three entries:
companies_archive
id | name | num_employees |
---|---|---|
1 | QuantCo | 90 |
2 | 140,000 | |
3 | BMW | 110,000 |
While "companies" contains an additional entry:
companies
id | name | num_employees |
---|---|---|
1 | QuantCo | 100 |
2 | 150,000 | |
3 | BMW | 120,000 |
4 | Apple | 145,000 |
import sqlalchemy as sa
eng = sa.create_engine('sqlite:///example.db')
with eng.connect() as con:
con.execute("CREATE TABLE companies (id INTEGER PRIMARY KEY, name TEXT, num_employees INTEGER)")
con.execute("INSERT INTO companies (name, num_employees) VALUES ('QuantCo', 100), ('Google', 150000), ('BMW', 120000), ('Apple', 145000)")
con.execute("CREATE TABLE companies_archive (id INTEGER PRIMARY KEY, name TEXT, num_employees INTEGER)")
con.execute("INSERT INTO companies_archive (name, num_employees) VALUES ('QuantCo', 90), ('Google', 140000), ('BMW', 110000)")
As an example, we will run 4 tests on this table:
- Does the table "companies" contain a column named "name"?
- Does the table "companies" contain at least 1 entry with the name "QuantCo"?
- Does the column "num_employees" of the "companies" table have all positive values?
- Does the column "name" of the table "companies" contain at least all the values of the corresponding column in "companies_archive"?
import pytest
import sqlalchemy as sa
from datajudge import (
Condition,
WithinRequirement,
BetweenRequirement,
)
from datajudge.pytest_integration import collect_data_tests
# We create a Requirement, within a table. This object will contain
# all the constraints we want to test on the specified table.
# To test another table or test the same table against another table,
# we would create another Requirement object.
companies_req = WithinRequirement.from_table(
db_name="example", schema_name=None, table_name="companies"
)
# Constraint 1: Does the table "companies" contain a column named "name"?
companies_req.add_column_existence_constraint(columns=["name"])
# Constraint 2: Does the table "companies" contain at least 1 entry with the name "QuantCo"?
condition = Condition(raw_string="name = 'QuantCo'")
companies_req.add_n_rows_min_constraint(n_rows_min=1, condition=condition)
# Constraint 3: Does the column "num_employees" of the "companies" table have all
# positive values?
companies_req.add_numeric_min_constraint(column="num_employees", min_value=1)
# We create a new Requirement, this time between different tables.
# Concretely, we intent to test constraints between the table "companies"
# and the table "companies_archive".
companies_between_req = BetweenRequirement.from_tables(
db_name1="example",
schema_name1=None,
table_name1="companies",
db_name2="example",
schema_name2=None,
table_name2="companies_archive",
)
# Constraint 4: Does the column "name" of the table "companies" contain at least all
# the values of the corresponding column in "companies_archive"?
companies_between_req.add_row_superset_constraint(
columns1=['name'], columns2=['name'], constant_max_missing_fraction=0
)
# collect_data_tests expects a pytest fixture with the name
# "datajudge_engine" that is a SQLAlchemy engine
@pytest.fixture()
def datajudge_engine():
return sa.create_engine("sqlite:///example.db")
# We gather our distinct Requirements in a list.
requirements = [companies_req, companies_between_req]
# "collect_data_tests" takes all requirements and turns their respective
# Constraints into individual tests. pytest will be able to pick
# up these tests.
test_constraint = collect_data_tests(requirements)
Saving this file as specification.py
and running $ pytest specification.py
will verify that all constraints are satisfied. The output you see in the terminal
should be similar to this:
=================================== test session starts ===================================
...
collected 4 items
specification.py::test_constraint[ColumnExistence::companies] PASSED [ 25%]
specification.py::test_constraint[NRowsMin::companies] PASSED [ 50%]
specification.py::test_constraint[NumericMin::companies] PASSED [ 75%]
specification.py::test_constraint[RowSuperset::companies|companies_archive] PASSED [100%]
==================================== 4 passed in 0.31s ====================================
You can also use a formatted html report using the --html=report.html
flag.