Replace strata with time-weighted, leave-one-out, rolling building average #101

dfsnow · 2025-02-04T22:39:53Z

This PR completely replaces our building strata features with a more performant, more predictive, more empirically sound rolling mean feature. Specifically, we construct the 5-year rolling average for each condo building, excluding the current sale and weighting by time, then use it as a predictor in the primary unit-level model.

This improves performance, has simpler code, and aligns more closely with how most people assume their condo unit is valued.

Closes #7, #21, #73, #93, #94.

dfsnow · 2025-02-07T01:28:22Z

@wrridgeway @jeancochrane This is now ready for re-review! I greatly simplified the mean calculation logic, added better commenting, and fixed some counting issues. I recommend taking a look at just the ingest script and diffing against 2025-assessment-year, not your last review.

jeancochrane

Very nice work 👏🏻

jeancochrane · 2025-02-07T17:31:33Z

pipeline/00-ingest.R

+  ,
+  `:=`(
+    wtd_mean = fifelse(
+      wtd_mean <= 0 | is.nan(wtd_mean) | is.infinite(wtd_mean),


[Question, non-blocking] What are the different cases in which we might expect a mean less than zero, a null mean, or an infinite mean? Some that come to mind:

Sales in the window have amounts that are incorrectly < 0 ➡️ Negative mean

No sales in the window ➡️ Infinite mean

My thought is that we might want to fail loudly for cases that point to problematic data (case 1), whereas it could be fine to coerce to null in cases where we expect weirdness (case 2). This would point to splitting these cases out into separate tests.

No strong feelings here! Just thinking out loud. It might make sense instead to coerce all of these cases to null as the current code does and file a separate issue to add tests to make sure we're catching problematic data up front.

I didn't actually see any < 0 mean cases, was just trying to be defensive here. But I think you're right it's worth throwing a warning or error if any sales or means are < 0. I'll add a check just below here for any such cases.

There are a bunch of things that make inf/NaN values here, including no (arms-length, market) sales in the window, being the very first sale in a building, being the only sale in the building, etc. I agree we're safe to coerce these cases to NA.

Just kidding, there's actually some floating point nastiness that causes a few negative means:

Where the first to columns are the numerator and denominator of the mean, respectively.

However, it seems like the issue is totally resolved by just setting the algo argument in the froll* functions to "exact", which I did in e4a82b8.

jeancochrane · 2025-02-07T17:42:12Z

pipeline/00-ingest.R

+  # To demo what's actually going on here with `findInterval()`:
+  #
+  # Given the following sales Y:
+  # 2015-12-01 2018-01-01 2022-06-15 2025-01-01
+  #
+  # And their 5-year offset X:
+  # 2010-12-01 2013-01-01 2017-06-15 2020-01-01
+  #
+  # For each element of X, find the _index position_ of the breaks in Y that
+  # contains that element e.g. for the first element of X:
+  #  2015-12-01 2018-01-01 2022-06-15 2025-01-01
+  # └── 2010-12-01 is outside any of the cuts, so the index is 0
+  #
+  # Or for the fourth element of X:
+  # 2015-12-01 2018-01-01 2022-06-15 2025-01-01
+  #                      └── 2020-01-01 is between these two, so the index is 2
+  #
+  # Using this technique, we can determine how many index positions we need to
+  # move before the offset current date (sale date - N years) finds prior sales
+  # that are within the N year time window. We then subtract that number of
+  # index positions from the window size, effectively shrinking the front of the
+  # window by N positions and excluding sales that are outside the N year offset


[Suggestion, non-blocking] Thank you for the clear example of the interval math! I was tripped up a bit thinking there was an off-by-one error due to my own forgetting that we exclude the target sale price later on in the transformation, so it might be helpful to hint at that here by adding one more paragraph at the end:

# # In the case of the 4th element of Y, we end up with a window size of 4 - 2 == 2. # This means that our rolling mean will include the last two sales in Y, the sales # at positions 3 and 4. Since position 4 is the target sale, we will avoid data # leakage later on in the pipeline by subtracting the target sale price from the # mean (or subtracting 0 in the case of the assessment set, which does not # have a sale price).

Good thinking. Added in 50f2358!

wrridgeway · 2025-02-07T18:36:18Z

Thanks for updating the commenting. Super clear what's going on now.

dfsnow and others added 5 commits February 3, 2025 22:04

Swap mean for leave-one-out mean

fced397

Merge branch '2025-assessment-year' into dfsnow/leave-one-out-strata

ccf9fa2

Finalize leave-one-out mean code

85103e6

Merge branch '2025-assessment-year' into dfsnow/leave-one-out-strata

b4a7956

Add improved time-based weighting

bcbcd2f

dfsnow had a problem deploying to deploy February 4, 2025 22:51 — with GitHub Actions Error

Dan Snow added 2 commits February 4, 2025 23:05

Fix sale price agg

e9ab30e

Update input data

740fb3f

dfsnow had a problem deploying to deploy February 4, 2025 23:13 — with GitHub Actions Failure

dfsnow temporarily deployed to deploy February 4, 2025 23:17 — with GitHub Actions Inactive

Dan Snow added 4 commits February 5, 2025 05:26

Add initial building mean construction

9db2003

Finalize building mean construction

f1722a7

Update input data with rolling means

8ccc5d7

Update training and imputation with roll means

88f6be0

dfsnow had a problem deploying to deploy February 5, 2025 06:35 — with GitHub Actions Failure

dfsnow had a problem deploying to deploy February 5, 2025 06:43 — with GitHub Actions Failure

Update linear model recipe

c8dfdfd

dfsnow force-pushed the dfsnow/time-wt-and-leave-one-out branch from 98d6429 to c8dfdfd Compare February 5, 2025 06:49

dfsnow temporarily deployed to deploy February 5, 2025 06:50 — with GitHub Actions Inactive

Update imputation vars

901e13e

dfsnow temporarily deployed to deploy February 5, 2025 07:57 — with GitHub Actions Inactive

Drop temp strata mapping

4171940

dfsnow temporarily deployed to deploy February 5, 2025 08:20 — with GitHub Actions Inactive

Drop strata report sections

e099f00

dfsnow temporarily deployed to deploy February 5, 2025 08:37 — with GitHub Actions Inactive

Dan Snow added 5 commits February 5, 2025 16:35

Finalize ingest stage changes

4cece41

Remove strata from README

fe92309

Update training and recipes with building mean feature

b3fa537

Finalize ingest for building mean

1027b23

Drop strata from pipeline stages

cff9287

Fix count off by one errors

02d0df1

dfsnow requested review from wrridgeway and jeancochrane February 6, 2025 17:11

Update input data

4a2d389

dfsnow temporarily deployed to deploy February 6, 2025 18:04 — with GitHub Actions Inactive

Dan Snow added 3 commits February 6, 2025 22:43

Simplify leave-one-out mean construction by removing lags

74e46e4

Drop extra slice

54a8926

Simplify and comment building mean construction

07de2d3

Dan Snow added 2 commits February 7, 2025 01:37

Cleanup docs and use NA in assessment data outputs

ee555aa

Update input data

1df7235

dfsnow temporarily deployed to deploy February 7, 2025 01:49 — with GitHub Actions Inactive

dfsnow temporarily deployed to deploy February 7, 2025 01:50 — with GitHub Actions Inactive

Rename CV tuning param to avoid conflict with existing trees()

3356d44

dfsnow temporarily deployed to deploy February 7, 2025 16:25 — with GitHub Actions Inactive

Rename tunable param

b3fb483

dfsnow temporarily deployed to deploy February 7, 2025 16:45 — with GitHub Actions Inactive

Rename tunable param

a5ef3ca

dfsnow temporarily deployed to deploy February 7, 2025 16:55 — with GitHub Actions Inactive

jeancochrane approved these changes Feb 7, 2025

View reviewed changes

Dan Snow added 3 commits February 7, 2025 18:01

Add more interval comments

50f2358

Add checks for negative building means

4120664

Use exact algo in froll functions

e4a82b8

dfsnow merged commit e642379 into 2025-assessment-year Feb 7, 2025
4 checks passed

dfsnow deleted the dfsnow/time-wt-and-leave-one-out branch February 7, 2025 18:34

This was referenced Feb 8, 2025

Investigate using condo sales val geographies as strata grouping var rather than township #21

Closed

Test a leave-one-out, time-weighted, rolling spatial lag feature ccao-data/model-res-avm#343

Open

This was referenced Feb 13, 2025

Add explanation of building means feature to README #105

Closed

Add descriptions for building rolling means features to README #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace strata with time-weighted, leave-one-out, rolling building average #101

Replace strata with time-weighted, leave-one-out, rolling building average #101

dfsnow commented Feb 4, 2025 •

edited

Loading

dfsnow commented Feb 7, 2025

jeancochrane left a comment

jeancochrane Feb 7, 2025 •

edited

Loading

dfsnow Feb 7, 2025

dfsnow Feb 7, 2025 •

edited

Loading

jeancochrane Feb 7, 2025

dfsnow Feb 7, 2025 •

edited

Loading

wrridgeway commented Feb 7, 2025

Replace strata with time-weighted, leave-one-out, rolling building average #101

Replace strata with time-weighted, leave-one-out, rolling building average #101

Conversation

dfsnow commented Feb 4, 2025 • edited Loading

dfsnow commented Feb 7, 2025

jeancochrane left a comment

Choose a reason for hiding this comment

jeancochrane Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

dfsnow Feb 7, 2025

Choose a reason for hiding this comment

dfsnow Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

jeancochrane Feb 7, 2025

Choose a reason for hiding this comment

dfsnow Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

wrridgeway commented Feb 7, 2025

dfsnow commented Feb 4, 2025 •

edited

Loading

jeancochrane Feb 7, 2025 •

edited

Loading

dfsnow Feb 7, 2025 •

edited

Loading

dfsnow Feb 7, 2025 •

edited

Loading