Update QC town close export script to prep for automation #626

jeancochrane · 2024-10-23T15:16:01Z

This PR makes a few quality-of-life improvements to the QC town close export script to enable automation via any distribution mechanism (VM, OneDrive, or S3). The changes include:

Define a new Township class that owns a few different configurations for each township:
- Township code and human-readable name
- Tri code and human-readable name
- Start/end dates for "active" periods, indicating periods when towns are eligible for QC
Support for a config file at scripts/utils/town_active_schedule.csv that can optionally configure the list of available Townships
Change the --township option so that it accepts one or more township codes
- When the caller provides multiple codes, the script will output one workbook for each report and workbook readers can filter the township_code column for the township they are interested in
- When the caller provides one code, the script will prepend the human-readable township name to the filename for each output workbook (e.g. "Res edit.xlsx" will become "Jefferson Res edit.xlsx"). This feature automates the renaming operation that we perform when exporting these reports manually
Change the --township option so that it is optional, and when omitted, falls back to all towns that are active per the town_active_schedule
Add an --output-dir option that controls the directory where the script will save reports

With these changes, we can follow these steps to define an automated process to export QC reports on a regular schedule:

Create a scripts/utils/town_active_schedule.csv config file on the machine that will run the process (or download it from S3 in an ephemeral environment like a GitHub workflow) with a defined activity schedule
Schedule a process to run the export_qc_town_close.py script at regular intervals
- Omit the --township argument so that the script will export all active towns
- Optionally set the --output-dir option on environments like the VM that need to write to specific locations but cannot chain an mv call to the scheduled export_qc_town_close.py call

… scripts

…n_close_reports

…s.py

dbt/scripts/export_qc_town_close_reports.py

jeancochrane · 2024-10-23T15:32:56Z

dbt/scripts/export_qc_town_close_reports.py

+CURRENT_YEAR = datetime.datetime.now().year
+CURRENT_DATE = datetime.datetime.now().date()


Defining module-level constants for "current time" is not usually a best-practice, but since these constants are only ever used by scripts that are manually run and short lived, it shouldn't be any different than loading the date/year dynamically in functions that need it.

dbt/README.md

dbt/scripts/utils/town_active_schedule.csv.example

dfsnow

I tested everything and it all works great. Think we could simplify the township instantiation a little bit but nothing major/blocking.

dbt/scripts/utils/town_active_schedule.csv.example

dbt/scripts/utils/townships.py

…s.py

…for-automation

jeancochrane · 2024-11-07T18:27:32Z

dbt/scripts/export_qc_town_close_reports.py

+                # Filter the query results for only this town, but only if the
+                # query results are not empty, since otherwise they will have
+                # no columns
+                town_df = (
+                    model.df
+                    if model.df.empty
+                    else model.df[
+                        model.df["township_code"] == town.township_code
+                    ]
+                )


There were two ways I thought about filtering results by township:

As part of the Athena query that pulls results, as the previous iteration of this PR did

After pulling data from Athena, as this code block does

Option 2 allows us to more easily export a model for each town, since filtering in Pandas allows us to avoid incurring the overhead of a bunch of extra Athena queries.

jeancochrane · 2024-11-07T18:29:13Z

dbt/scripts/utils/export.py

+    models_for_export = query_models_for_export(
+        select=select,
+        selector=selector,
+        rebuild=rebuild,
+        where=where,
+    )
+    output_paths: list[pathlib.Path] = []
+    for model in models_for_export:
+        output_path = save_model_to_workbook(model, output_dir)
+        output_paths.append(output_path)
+    return output_paths


I don't feel 100% great about the way that I've cut up this function to expose internals that export_qc_town_close_reports can use to modify the query data. In particular, I think it's kind of cumbersome to deal with the ModelForExport abstraction that acts as the interface between the query and save functions. I'm curious if you have thoughts on a better way to separate the concerns.

Kind of a bizarre idea, but maybe something like:

Have a query for each model

Have a filter for each model

Create a unique set of all the query values across models, run them, then populate the models using the results (and apply the filter to each one).

Might be more complicated? Maybe save it for the next incremental improvement.

I like this idea! I'll open up an issue witih this design as an iterative improvement, and then pick it up once we've validated that the ability to run reports for all towns is an important feature for the project.

Issue open here: #637

jeancochrane · 2024-11-07T18:30:39Z

dbt/scripts/utils/export.py

+def save_model_to_workbook(
+    model: ModelForExport,
+    output_dir: str | None = None,
+) -> pathlib.Path:


This interface in particular feels a bit brittle to me, since in order to add or change a parameter you have to edit the ModelForExport interface as well. A caller also has to understand ModelForExport and construct if not calling query_models_for_export first or if modifying query results between the query and the save state. Not perfect, but it was the best incremental design improvement I could think of.

jeancochrane · 2024-11-07T18:31:41Z

dbt/scripts/utils/export.py

+    # Compute the full filepath for the output
+    output_dirpath = (
+        pathlib.Path(output_dir)
+        if output_dir
+        else pathlib.Path("export/output")
+    )
+    output_path = output_dirpath / model.export_filename


Beyond this step, very little of the logic of this function has changed, it's just present in the diff because we've extracted it out from the old export_models() implementation. I'll leave a comment on anything major that needs attention.

jeancochrane · 2024-11-07T18:32:22Z

dbt/scripts/utils/townships.py

+@dataclasses.dataclass
+class Township:
+    """Class that represents a township for the purposes of QC reporting"""
+
+    township_code: str
+    township_name: str
+    tri_code: str
+    tri_name: str
+
+    def is_reassessed_during(self, year: int) -> bool:
+        """Helper function to determine if a town in a given year is undergoing
+        triennial reassessment"""
+        # 2024 is the City reassessment year (tri code 1), so
+        # ((2024 - 2024) % 3) + 1 == 1, and so on for the other two tris
+        return str(((year - 2024) % 3) + 1) == self.tri_code


This abstraction really seems like it should live in a ccao package, but absent that package, I think it's useful for the purposes of the export code.

Yep! We can move it there once that package is finished.

jeancochrane · 2024-11-07T18:33:14Z

dbt/scripts/utils/townships.py

+# Instantiate a canonical list of townships
+township_attrs = [
+    # township_code, township_name, tri_code, tri_name
+    ("10", "Barrington", "2", "North"),
+    ("11", "Berwyn", "3", "South"),
+    ("12", "Bloom", "3", "South"),
+    ("13", "Bremen", "3", "South"),
+    ("14", "Calumet", "3", "South"),
+    ("15", "Cicero", "3", "South"),
+    ("16", "Elk Grove", "2", "North"),
+    ("17", "Evanston", "2", "North"),
+    ("18", "Hanover", "2", "North"),
+    ("19", "Lemont", "3", "South"),
+    ("20", "Leyden", "2", "North"),
+    ("21", "Lyons", "3", "South"),
+    ("22", "Maine", "2", "North"),
+    ("23", "New Trier", "2", "North"),
+    ("24", "Niles", "2", "North"),
+    ("25", "Northfield", "2", "North"),
+    ("26", "Norwood Park", "2", "North"),
+    ("27", "Oak Park", "3", "South"),
+    ("28", "Orland", "3", "South"),
+    ("29", "Palatine", "2", "North"),
+    ("30", "Palos", "3", "South"),
+    ("31", "Proviso", "3", "South"),
+    ("32", "Rich", "3", "South"),
+    ("33", "River Forest", "3", "South"),
+    ("34", "Riverside", "3", "South"),
+    ("35", "Schaumburg", "2", "North"),
+    ("36", "Stickney", "3", "South"),
+    ("37", "Thornton", "3", "South"),
+    ("38", "Wheeling", "2", "North"),
+    ("39", "Worth", "3", "South"),
+    ("70", "Hyde Park", "1", "City"),
+    ("71", "Jefferson", "1", "City"),
+    ("72", "Lake", "1", "City"),
+    ("73", "Lake View", "1", "City"),
+    ("74", "North Chicago", "1", "City"),
+    ("75", "Rogers Park", "1", "City"),
+    ("76", "South Chicago", "1", "City"),
+    ("77", "West Chicago", "1", "City"),
+]


I wonder if this data would make sense as a dbt seed? Not something that I think we need to do now, but something I'm thinking about for the future.

We should make it a seed and have it as a built-in dataset in the ccao package once that's setup.

jeancochrane · 2024-11-07T18:36:00Z

@dfsnow I finished a pretty major refactor of the PR, so I think it's worth taking a fresh look at everything. Instead of the no-filter default being a set of active towns determined by a schedule, the script will now default to exporting reports for all towns, using pandas to speed up the township filtering operation. Testing this locally, it takes about 30 minutes to run an export for all towns, and we scan the same amount of data in Athena since our tables are not partitioned by township anyway.

dfsnow

You were right, this setup is definitely a bit simpler, even if it requires splitting things into query and save components. I think this is fine for now, it's a good incremental improvement. If we need to refactor it later it should be straightforward.

…for-automation

jeancochrane and others added 4 commits September 26, 2024 17:18

Add --output-dir arg to export_models and export_qc_town_close_report…

74d9cb0

… scripts

Support multiple townships and active/inactive dates in export_qc_tow…

c6ea3e8

…n_close_reports

Clearer comments in town close export script

ce6186c

Simplify final query logic in dbt/scripts/export_qc_town_close_report…

a98327a

…s.py

jeancochrane commented Oct 23, 2024

View reviewed changes

jeancochrane added 2 commits October 23, 2024 16:20

Tweak comments for QC town close reports

7232e56

Resolve mypy errors in dbt/scripts/utils/townships.py

48edfbe

jeancochrane commented Oct 31, 2024

View reviewed changes

dbt/README.md Show resolved Hide resolved

dbt/scripts/utils/town_active_schedule.csv.example Outdated Show resolved Hide resolved

jeancochrane marked this pull request as ready for review October 31, 2024 18:03

jeancochrane requested a review from a team as a code owner October 31, 2024 18:03

jeancochrane requested a review from dfsnow October 31, 2024 18:04

dfsnow approved these changes Nov 5, 2024

View reviewed changes

dbt/scripts/utils/town_active_schedule.csv.example Outdated Show resolved Hide resolved

dbt/scripts/utils/town_active_schedule.csv.example Outdated Show resolved Hide resolved

dbt/scripts/utils/townships.py Outdated Show resolved Hide resolved

jeancochrane added 3 commits November 6, 2024 11:38

Simplilfy TOWNSHIPS constant definition in dbt/scripts/utils/township…

5773d0a

…s.py

Update town close export script to export all towns by default

0b73cae

Clean up docs for new QC town close reporting script design

ea8f25f

jeancochrane force-pushed the jeancochrane/updates-to-town-close-script-for-automation branch from b7d15ae to ea8f25f Compare November 7, 2024 17:35

Merge branch 'master' into jeancochrane/updates-to-town-close-script-…

40bc7cf

…for-automation

jeancochrane commented Nov 7, 2024

View reviewed changes

jeancochrane requested a review from dfsnow November 7, 2024 18:36

dfsnow approved these changes Nov 7, 2024

View reviewed changes

Merge branch 'master' into jeancochrane/updates-to-town-close-script-…

806ba0c

…for-automation

jeancochrane mentioned this pull request Nov 7, 2024

Improve design of export module interface #637

Open

jeancochrane merged commit 0a331b4 into master Nov 7, 2024
7 checks passed

jeancochrane deleted the jeancochrane/updates-to-town-close-script-for-automation branch November 7, 2024 21:24

jeancochrane mentioned this pull request Dec 26, 2024

Add models for exporting daily Commercial QC reference files #678

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update QC town close export script to prep for automation #626

Update QC town close export script to prep for automation #626

jeancochrane commented Oct 23, 2024

jeancochrane Oct 23, 2024

dfsnow left a comment

jeancochrane Nov 7, 2024

jeancochrane Nov 7, 2024

dfsnow Nov 7, 2024

jeancochrane Nov 7, 2024

jeancochrane Nov 7, 2024

jeancochrane Nov 7, 2024

jeancochrane Nov 7, 2024

jeancochrane Nov 7, 2024

dfsnow Nov 7, 2024

jeancochrane Nov 7, 2024

dfsnow Nov 7, 2024

jeancochrane commented Nov 7, 2024

dfsnow left a comment

		CURRENT_YEAR = datetime.datetime.now().year
		CURRENT_DATE = datetime.datetime.now().date()

Update QC town close export script to prep for automation #626

Update QC town close export script to prep for automation #626

Conversation

jeancochrane commented Oct 23, 2024

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Nov 7, 2024

dfsnow left a comment

Choose a reason for hiding this comment