-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace strata with time-weighted, leave-one-out, rolling building average #101
Replace strata with time-weighted, leave-one-out, rolling building average #101
Conversation
98d6429
to
c8dfdfd
Compare
@wrridgeway @jeancochrane This is now ready for re-review! I greatly simplified the mean calculation logic, added better commenting, and fixed some counting issues. I recommend taking a look at just the ingest script and diffing against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work 👏🏻
pipeline/00-ingest.R
Outdated
, | ||
`:=`( | ||
wtd_mean = fifelse( | ||
wtd_mean <= 0 | is.nan(wtd_mean) | is.infinite(wtd_mean), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] What are the different cases in which we might expect a mean less than zero, a null mean, or an infinite mean? Some that come to mind:
- Sales in the window have amounts that are incorrectly < 0 ➡️ Negative mean
- No sales in the window ➡️ Infinite mean
My thought is that we might want to fail loudly for cases that point to problematic data (case 1), whereas it could be fine to coerce to null in cases where we expect weirdness (case 2). This would point to splitting these cases out into separate tests.
No strong feelings here! Just thinking out loud. It might make sense instead to coerce all of these cases to null as the current code does and file a separate issue to add tests to make sure we're catching problematic data up front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually see any < 0 mean cases, was just trying to be defensive here. But I think you're right it's worth throwing a warning or error if any sales or means are < 0. I'll add a check just below here for any such cases.
There are a bunch of things that make inf/NaN values here, including no (arms-length, market) sales in the window, being the very first sale in a building, being the only sale in the building, etc. I agree we're safe to coerce these cases to NA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just kidding, there's actually some floating point nastiness that causes a few negative means:
Where the first to columns are the numerator and denominator of the mean, respectively.
However, it seems like the issue is totally resolved by just setting the algo
argument in the froll*
functions to "exact"
, which I did in e4a82b8.
pipeline/00-ingest.R
Outdated
# To demo what's actually going on here with `findInterval()`: | ||
# | ||
# Given the following sales Y: | ||
# 2015-12-01 2018-01-01 2022-06-15 2025-01-01 | ||
# | ||
# And their 5-year offset X: | ||
# 2010-12-01 2013-01-01 2017-06-15 2020-01-01 | ||
# | ||
# For each element of X, find the _index position_ of the breaks in Y that | ||
# contains that element e.g. for the first element of X: | ||
# 2015-12-01 2018-01-01 2022-06-15 2025-01-01 | ||
# └── 2010-12-01 is outside any of the cuts, so the index is 0 | ||
# | ||
# Or for the fourth element of X: | ||
# 2015-12-01 2018-01-01 2022-06-15 2025-01-01 | ||
# └── 2020-01-01 is between these two, so the index is 2 | ||
# | ||
# Using this technique, we can determine how many index positions we need to | ||
# move before the offset current date (sale date - N years) finds prior sales | ||
# that are within the N year time window. We then subtract that number of | ||
# index positions from the window size, effectively shrinking the front of the | ||
# window by N positions and excluding sales that are outside the N year offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion, non-blocking] Thank you for the clear example of the interval math! I was tripped up a bit thinking there was an off-by-one error due to my own forgetting that we exclude the target sale price later on in the transformation, so it might be helpful to hint at that here by adding one more paragraph at the end:
#
# In the case of the 4th element of Y, we end up with a window size of 4 - 2 == 2.
# This means that our rolling mean will include the last two sales in Y, the sales
# at positions 3 and 4. Since position 4 is the target sale, we will avoid data
# leakage later on in the pipeline by subtracting the target sale price from the
# mean (or subtracting 0 in the case of the assessment set, which does not
# have a sale price).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. Added in 50f2358!
Thanks for updating the commenting. Super clear what's going on now. |
This PR completely replaces our building strata features with a more performant, more predictive, more empirically sound rolling mean feature. Specifically, we construct the 5-year rolling average for each condo building, excluding the current sale and weighting by time, then use it as a predictor in the primary unit-level model.
This improves performance, has simpler code, and aligns more closely with how most people assume their condo unit is valued.
Closes #7, #21, #73, #93, #94.