Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update QC town close export script to prep for automation #626

Merged

Conversation

jeancochrane
Copy link
Contributor

This PR makes a few quality-of-life improvements to the QC town close export script to enable automation via any distribution mechanism (VM, OneDrive, or S3). The changes include:

  • Define a new Township class that owns a few different configurations for each township:
    • Township code and human-readable name
    • Tri code and human-readable name
    • Start/end dates for "active" periods, indicating periods when towns are eligible for QC
  • Support for a config file at scripts/utils/town_active_schedule.csv that can optionally configure the list of available Townships
  • Change the --township option so that it accepts one or more township codes
    • When the caller provides multiple codes, the script will output one workbook for each report and workbook readers can filter the township_code column for the township they are interested in
    • When the caller provides one code, the script will prepend the human-readable township name to the filename for each output workbook (e.g. "Res edit.xlsx" will become "Jefferson Res edit.xlsx"). This feature automates the renaming operation that we perform when exporting these reports manually
  • Change the --township option so that it is optional, and when omitted, falls back to all towns that are active per the town_active_schedule
  • Add an --output-dir option that controls the directory where the script will save reports

With these changes, we can follow these steps to define an automated process to export QC reports on a regular schedule:

  • Create a scripts/utils/town_active_schedule.csv config file on the machine that will run the process (or download it from S3 in an ephemeral environment like a GitHub workflow) with a defined activity schedule
  • Schedule a process to run the export_qc_town_close.py script at regular intervals
    • Omit the --township argument so that the script will export all active towns
    • Optionally set the --output-dir option on environments like the VM that need to write to specific locations but cannot chain an mv call to the scheduled export_qc_town_close.py call

Comment on lines 20 to 21
CURRENT_YEAR = datetime.datetime.now().year
CURRENT_DATE = datetime.datetime.now().date()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining module-level constants for "current time" is not usually a best-practice, but since these constants are only ever used by scripts that are manually run and short lived, it shouldn't be any different than loading the date/year dynamically in functions that need it.

@jeancochrane jeancochrane marked this pull request as ready for review October 31, 2024 18:03
@jeancochrane jeancochrane requested a review from a team as a code owner October 31, 2024 18:03
@jeancochrane jeancochrane requested a review from dfsnow October 31, 2024 18:04
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested everything and it all works great. Think we could simplify the township instantiation a little bit but nothing major/blocking.

@jeancochrane jeancochrane force-pushed the jeancochrane/updates-to-town-close-script-for-automation branch from b7d15ae to ea8f25f Compare November 7, 2024 17:35
Comment on lines +219 to +228
# Filter the query results for only this town, but only if the
# query results are not empty, since otherwise they will have
# no columns
town_df = (
model.df
if model.df.empty
else model.df[
model.df["township_code"] == town.township_code
]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two ways I thought about filtering results by township:

  1. As part of the Athena query that pulls results, as the previous iteration of this PR did
  2. After pulling data from Athena, as this code block does

Option 2 allows us to more easily export a model for each town, since filtering in Pandas allows us to avoid incurring the overhead of a bunch of extra Athena queries.

Comment on lines +57 to +67
models_for_export = query_models_for_export(
select=select,
selector=selector,
rebuild=rebuild,
where=where,
)
output_paths: list[pathlib.Path] = []
for model in models_for_export:
output_path = save_model_to_workbook(model, output_dir)
output_paths.append(output_path)
return output_paths
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel 100% great about the way that I've cut up this function to expose internals that export_qc_town_close_reports can use to modify the query data. In particular, I think it's kind of cumbersome to deal with the ModelForExport abstraction that acts as the interface between the query and save functions. I'm curious if you have thoughts on a better way to separate the concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a bizarre idea, but maybe something like:

  • Have a query for each model
  • Have a filter for each model

Create a unique set of all the query values across models, run them, then populate the models using the results (and apply the filter to each one).

Might be more complicated? Maybe save it for the next incremental improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! I'll open up an issue witih this design as an iterative improvement, and then pick it up once we've validated that the ability to run reports for all towns is an important feature for the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue open here: #637

Comment on lines +222 to +225
def save_model_to_workbook(
model: ModelForExport,
output_dir: str | None = None,
) -> pathlib.Path:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface in particular feels a bit brittle to me, since in order to add or change a parameter you have to edit the ModelForExport interface as well. A caller also has to understand ModelForExport and construct if not calling query_models_for_export first or if modifying query results between the query and the save state. Not perfect, but it was the best incremental design improvement I could think of.

Comment on lines +241 to +247
# Compute the full filepath for the output
output_dirpath = (
pathlib.Path(output_dir)
if output_dir
else pathlib.Path("export/output")
)
output_path = output_dirpath / model.export_filename
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond this step, very little of the logic of this function has changed, it's just present in the diff because we've extracted it out from the old export_models() implementation. I'll leave a comment on anything major that needs attention.

Comment on lines +5 to +19
@dataclasses.dataclass
class Township:
"""Class that represents a township for the purposes of QC reporting"""

township_code: str
township_name: str
tri_code: str
tri_name: str

def is_reassessed_during(self, year: int) -> bool:
"""Helper function to determine if a town in a given year is undergoing
triennial reassessment"""
# 2024 is the City reassessment year (tri code 1), so
# ((2024 - 2024) % 3) + 1 == 1, and so on for the other two tris
return str(((year - 2024) % 3) + 1) == self.tri_code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction really seems like it should live in a ccao package, but absent that package, I think it's useful for the purposes of the export code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! We can move it there once that package is finished.

Comment on lines +22 to +63
# Instantiate a canonical list of townships
township_attrs = [
# township_code, township_name, tri_code, tri_name
("10", "Barrington", "2", "North"),
("11", "Berwyn", "3", "South"),
("12", "Bloom", "3", "South"),
("13", "Bremen", "3", "South"),
("14", "Calumet", "3", "South"),
("15", "Cicero", "3", "South"),
("16", "Elk Grove", "2", "North"),
("17", "Evanston", "2", "North"),
("18", "Hanover", "2", "North"),
("19", "Lemont", "3", "South"),
("20", "Leyden", "2", "North"),
("21", "Lyons", "3", "South"),
("22", "Maine", "2", "North"),
("23", "New Trier", "2", "North"),
("24", "Niles", "2", "North"),
("25", "Northfield", "2", "North"),
("26", "Norwood Park", "2", "North"),
("27", "Oak Park", "3", "South"),
("28", "Orland", "3", "South"),
("29", "Palatine", "2", "North"),
("30", "Palos", "3", "South"),
("31", "Proviso", "3", "South"),
("32", "Rich", "3", "South"),
("33", "River Forest", "3", "South"),
("34", "Riverside", "3", "South"),
("35", "Schaumburg", "2", "North"),
("36", "Stickney", "3", "South"),
("37", "Thornton", "3", "South"),
("38", "Wheeling", "2", "North"),
("39", "Worth", "3", "South"),
("70", "Hyde Park", "1", "City"),
("71", "Jefferson", "1", "City"),
("72", "Lake", "1", "City"),
("73", "Lake View", "1", "City"),
("74", "North Chicago", "1", "City"),
("75", "Rogers Park", "1", "City"),
("76", "South Chicago", "1", "City"),
("77", "West Chicago", "1", "City"),
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this data would make sense as a dbt seed? Not something that I think we need to do now, but something I'm thinking about for the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it a seed and have it as a built-in dataset in the ccao package once that's setup.

@jeancochrane
Copy link
Contributor Author

@dfsnow I finished a pretty major refactor of the PR, so I think it's worth taking a fresh look at everything. Instead of the no-filter default being a set of active towns determined by a schedule, the script will now default to exporting reports for all towns, using pandas to speed up the township filtering operation. Testing this locally, it takes about 30 minutes to run an export for all towns, and we scan the same amount of data in Athena since our tables are not partitioned by township anyway.

@jeancochrane jeancochrane requested a review from dfsnow November 7, 2024 18:36
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right, this setup is definitely a bit simpler, even if it requires splitting things into query and save components. I think this is fine for now, it's a good incremental improvement. If we need to refactor it later it should be straightforward.

@jeancochrane jeancochrane merged commit 0a331b4 into master Nov 7, 2024
7 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/updates-to-town-close-script-for-automation branch November 7, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants