-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update QC town close export script to prep for automation #626
Update QC town close export script to prep for automation #626
Conversation
CURRENT_YEAR = datetime.datetime.now().year | ||
CURRENT_DATE = datetime.datetime.now().date() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining module-level constants for "current time" is not usually a best-practice, but since these constants are only ever used by scripts that are manually run and short lived, it shouldn't be any different than loading the date/year dynamically in functions that need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested everything and it all works great. Think we could simplify the township instantiation a little bit but nothing major/blocking.
b7d15ae
to
ea8f25f
Compare
# Filter the query results for only this town, but only if the | ||
# query results are not empty, since otherwise they will have | ||
# no columns | ||
town_df = ( | ||
model.df | ||
if model.df.empty | ||
else model.df[ | ||
model.df["township_code"] == town.township_code | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were two ways I thought about filtering results by township:
- As part of the Athena query that pulls results, as the previous iteration of this PR did
- After pulling data from Athena, as this code block does
Option 2 allows us to more easily export a model for each town, since filtering in Pandas allows us to avoid incurring the overhead of a bunch of extra Athena queries.
models_for_export = query_models_for_export( | ||
select=select, | ||
selector=selector, | ||
rebuild=rebuild, | ||
where=where, | ||
) | ||
output_paths: list[pathlib.Path] = [] | ||
for model in models_for_export: | ||
output_path = save_model_to_workbook(model, output_dir) | ||
output_paths.append(output_path) | ||
return output_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel 100% great about the way that I've cut up this function to expose internals that export_qc_town_close_reports
can use to modify the query data. In particular, I think it's kind of cumbersome to deal with the ModelForExport
abstraction that acts as the interface between the query
and save
functions. I'm curious if you have thoughts on a better way to separate the concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of a bizarre idea, but maybe something like:
- Have a
query
for each model - Have a
filter
for each model
Create a unique set of all the query
values across models, run them, then populate the models using the results (and apply the filter to each one).
Might be more complicated? Maybe save it for the next incremental improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! I'll open up an issue witih this design as an iterative improvement, and then pick it up once we've validated that the ability to run reports for all towns is an important feature for the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue open here: #637
def save_model_to_workbook( | ||
model: ModelForExport, | ||
output_dir: str | None = None, | ||
) -> pathlib.Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface in particular feels a bit brittle to me, since in order to add or change a parameter you have to edit the ModelForExport
interface as well. A caller also has to understand ModelForExport
and construct if not calling query_models_for_export
first or if modifying query results between the query
and the save
state. Not perfect, but it was the best incremental design improvement I could think of.
# Compute the full filepath for the output | ||
output_dirpath = ( | ||
pathlib.Path(output_dir) | ||
if output_dir | ||
else pathlib.Path("export/output") | ||
) | ||
output_path = output_dirpath / model.export_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond this step, very little of the logic of this function has changed, it's just present in the diff because we've extracted it out from the old export_models()
implementation. I'll leave a comment on anything major that needs attention.
@dataclasses.dataclass | ||
class Township: | ||
"""Class that represents a township for the purposes of QC reporting""" | ||
|
||
township_code: str | ||
township_name: str | ||
tri_code: str | ||
tri_name: str | ||
|
||
def is_reassessed_during(self, year: int) -> bool: | ||
"""Helper function to determine if a town in a given year is undergoing | ||
triennial reassessment""" | ||
# 2024 is the City reassessment year (tri code 1), so | ||
# ((2024 - 2024) % 3) + 1 == 1, and so on for the other two tris | ||
return str(((year - 2024) % 3) + 1) == self.tri_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstraction really seems like it should live in a ccao
package, but absent that package, I think it's useful for the purposes of the export
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! We can move it there once that package is finished.
# Instantiate a canonical list of townships | ||
township_attrs = [ | ||
# township_code, township_name, tri_code, tri_name | ||
("10", "Barrington", "2", "North"), | ||
("11", "Berwyn", "3", "South"), | ||
("12", "Bloom", "3", "South"), | ||
("13", "Bremen", "3", "South"), | ||
("14", "Calumet", "3", "South"), | ||
("15", "Cicero", "3", "South"), | ||
("16", "Elk Grove", "2", "North"), | ||
("17", "Evanston", "2", "North"), | ||
("18", "Hanover", "2", "North"), | ||
("19", "Lemont", "3", "South"), | ||
("20", "Leyden", "2", "North"), | ||
("21", "Lyons", "3", "South"), | ||
("22", "Maine", "2", "North"), | ||
("23", "New Trier", "2", "North"), | ||
("24", "Niles", "2", "North"), | ||
("25", "Northfield", "2", "North"), | ||
("26", "Norwood Park", "2", "North"), | ||
("27", "Oak Park", "3", "South"), | ||
("28", "Orland", "3", "South"), | ||
("29", "Palatine", "2", "North"), | ||
("30", "Palos", "3", "South"), | ||
("31", "Proviso", "3", "South"), | ||
("32", "Rich", "3", "South"), | ||
("33", "River Forest", "3", "South"), | ||
("34", "Riverside", "3", "South"), | ||
("35", "Schaumburg", "2", "North"), | ||
("36", "Stickney", "3", "South"), | ||
("37", "Thornton", "3", "South"), | ||
("38", "Wheeling", "2", "North"), | ||
("39", "Worth", "3", "South"), | ||
("70", "Hyde Park", "1", "City"), | ||
("71", "Jefferson", "1", "City"), | ||
("72", "Lake", "1", "City"), | ||
("73", "Lake View", "1", "City"), | ||
("74", "North Chicago", "1", "City"), | ||
("75", "Rogers Park", "1", "City"), | ||
("76", "South Chicago", "1", "City"), | ||
("77", "West Chicago", "1", "City"), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this data would make sense as a dbt seed? Not something that I think we need to do now, but something I'm thinking about for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make it a seed and have it as a built-in dataset in the ccao
package once that's setup.
@dfsnow I finished a pretty major refactor of the PR, so I think it's worth taking a fresh look at everything. Instead of the no-filter default being a set of active towns determined by a schedule, the script will now default to exporting reports for all towns, using pandas to speed up the township filtering operation. Testing this locally, it takes about 30 minutes to run an export for all towns, and we scan the same amount of data in Athena since our tables are not partitioned by township anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right, this setup is definitely a bit simpler, even if it requires splitting things into query and save components. I think this is fine for now, it's a good incremental improvement. If we need to refactor it later it should be straightforward.
This PR makes a few quality-of-life improvements to the QC town close export script to enable automation via any distribution mechanism (VM, OneDrive, or S3). The changes include:
Township
class that owns a few different configurations for each township:scripts/utils/town_active_schedule.csv
that can optionally configure the list of availableTownship
s--township
option so that it accepts one or more township codestownship_code
column for the township they are interested in--township
option so that it is optional, and when omitted, falls back to all towns that are active per thetown_active_schedule
--output-dir
option that controls the directory where the script will save reportsWith these changes, we can follow these steps to define an automated process to export QC reports on a regular schedule:
scripts/utils/town_active_schedule.csv
config file on the machine that will run the process (or download it from S3 in an ephemeral environment like a GitHub workflow) with a defined activity scheduleexport_qc_town_close.py
script at regular intervals--township
argument so that the script will export all active towns--output-dir
option on environments like the VM that need to write to specific locations but cannot chain anmv
call to the scheduledexport_qc_town_close.py
call