Replace strata with time-weighted, leave-one-out, rolling building av…

…erage (#101) * Swap mean for leave-one-out mean * Finalize leave-one-out mean code * Add improved time-based weighting * Fix sale price agg * Update input data * Add initial building mean construction * Finalize building mean construction * Update input data with rolling means * Update training and imputation with roll means * Update linear model recipe * Update imputation vars * Drop temp strata mapping * Drop strata report sections * Finalize ingest stage changes * Remove strata from README * Update training and recipes with building mean feature * Finalize ingest for building mean * Drop strata from pipeline stages * Freeze ingest stage * Update Desk Review template * Save data at the end of ingest (doh) * Update input data * Update dvc lockfile * Bump ccao vars_dict * Fix README typos * Reformat DVC yaml * Fill counts with 0 * Update input data * Fix NA, NaN, Inf filling for rolling values * Remove dupe param * Update note about imputing * Fix filling in training data * Add pct sold feature * Edit README language * Fix count off by one errors * Update input data * Simplify leave-one-out mean construction by removing lags * Drop extra slice * Simplify and comment building mean construction * Cleanup docs and use NA in assessment data outputs * Update input data * Rename CV tuning param to avoid conflict with existing trees() * Rename tunable param * Rename tunable param * Add more interval comments * Add checks for negative building means * Use exact algo in froll functions
ccao-data · Feb 7, 2025 · e642379 · e642379
1 parent 4a8d9a0
commit e642379
Show file tree

Hide file tree

Showing 15 changed files with 488 additions and 887 deletions.
diff --git a/R/recipes.R b/R/recipes.R
@@ -10,17 +10,18 @@
 #'   will be the right-hand side of the regression AKA predictors.
 #' @param cat_vars Character vector of categorical column names. These will be
 #'   integer-encoded (base 0).
-#' @param knn_vars Character vector of column names. These columns will have
-#'   missing values imputed via KNN.
-#' @param knn_imp_vars Character vector of column names. These columns will be
-#'   used to impute the columns in knn_vars.
+#' @param imp Character vector of column names. These columns will have
+#'   missing values imputed.
+#' @param imp_vars Character vector of column names. These columns will be
+#'   used to impute the columns in imp.
 #' @param id_vars Character vector of ID variables. These can be kept in "baked"
 #'   data without being treated as predictors.
+#' @param seed Integer seed value for reproducibility.
 #'
 #' @return A recipe object that can be used to clean model input data.
 #'
 model_main_recipe <- function(data, pred_vars, cat_vars,
-                              knn_vars, knn_imp_vars, id_vars) {
+                              imp, imp_vars, id_vars, seed) {
   recipe(data) %>%
     # Set the role of each variable in the input data
     update_role(meta_sale_price, new_role = "outcome") %>%
@@ -30,19 +31,12 @@ model_main_recipe <- function(data, pred_vars, cat_vars,
     update_role_requirements("NA", bake = FALSE) %>%
     # Remove any variables not an outcome var or in the pred_vars vector
     step_rm(-all_outcomes(), -all_predictors(), -has_role("ID")) %>%
-    # Impute missing values using KNN. Specific to condo model, usually used to
-    # impute missing condo building strata. Within step_impute_knn, an estimated
-    # node value is called with the sample(). This is not deterministic, meaning
-    # different runs of the model will have different imputed values, and thus
-    # different FMVs.
-    step_impute_knn(
-      all_of(knn_vars),
-      neighbors = tune(),
-      impute_with = imp_vars(all_of(knn_imp_vars)),
-      options = list(
-        nthread = parallel::detectCores(logical = FALSE),
-        eps = 1e-08
-      )
+    # Impute missing values using a separate tree model
+    step_impute_bag(
+      all_of(imp),
+      trees = tune("imp_trees"),
+      impute_with = imp_vars(all_of(imp_vars)),
+      seed_val = seed
     ) %>%
     # Replace novel levels with "new"
     step_novel(all_of(cat_vars), -has_role("ID")) %>%
@@ -66,17 +60,18 @@ model_main_recipe <- function(data, pred_vars, cat_vars,
 #'   will be the right-hand side of the regression AKA predictors.
 #' @param cat_vars Character vector of categorical column names. These will be
 #'   transformed/encoded using embeddings.
-#' @param knn_vars Character vector of column names. These columns will have
-#'   missing values imputed via KNN.
-#' @param knn_imp_vars Character vector of column names. These columns will be
-#'   used to impute the columns in knn_vars.
+#' @param imp Character vector of column names. These columns will have
+#'   missing values imputed.
+#' @param imp_vars Character vector of column names. These columns will be
+#'   used to impute the columns in imp.
 #' @param id_vars Character vector of ID variables. These can be kept in "baked"
 #'   data without being treated as predictors.
+#' @param seed Integer seed value for reproducibility.
 #'
 #' @return A recipe object that can be used to clean model input data.
 #'
 model_lin_recipe <- function(data, pred_vars, cat_vars,
-                             knn_vars, knn_imp_vars, id_vars) {
+                             imp, imp_vars, id_vars, seed) {
   recipe(data) %>%
     # Set the role of each variable in the input data
     update_role(meta_sale_price, new_role = "outcome") %>%
@@ -89,16 +84,12 @@ model_lin_recipe <- function(data, pred_vars, cat_vars,
     step_rm(-all_outcomes(), -all_predictors(), -has_role("ID")) %>%
     # Drop extra location predictors that aren't nbhd or township
     step_rm(starts_with("loc_"), -all_numeric_predictors()) %>%
-    # Impute missing values using KNN. Specific to condo model, usually used to
-    # impute missing condo building strata
-    step_impute_knn(
-      all_of(knn_vars),
-      neighbors = tune(),
-      impute_with = imp_vars(all_of(knn_imp_vars)),
-      options = list(
-        nthread = parallel::detectCores(logical = FALSE),
-        eps = 1e-08
-      )
+    # Impute missing values using a separate tree model
+    step_impute_bag(
+      all_of(imp),
+      trees = tune("imp_trees"),
+      impute_with = imp_vars(all_of(imp_vars)),
+      seed_val = seed
     ) %>%
     # Transforms and imputations
     step_mutate(

diff --git a/README.Rmd b/README.Rmd
@@ -54,13 +54,14 @@ Like most assessors nationwide, our office staff cannot enter buildings to obser
 The only _complete_ information our office currently has about individual condominium units is their age, location, sale date/price, and percentage of ownership. This makes modeling condos particularly challenging, as the number of usable features is quite small. Fortunately, condos have two qualities which make modeling a bit easier:
 
 1. Condos are more homogeneous than single/multi-family properties, i.e. the range of potential condo sale prices is much narrower.
-2. Condo are pre-grouped into clusters of like units (buildings), and units within the same building usually have similar sale prices.
+2. Condos are pre-grouped into clusters of like units (buildings), and units within the same building usually have similar sale prices.
 
-We leverage these qualities to produce what we call ***strata***, a feature unique to the condo model. See [Condo Strata](#condo-strata) for more information about how strata is used and calculated.
+We leverage these qualities to produce a time-weighted, rolling average sale price for
+each building which is then used as a predictor in the unit-level model.
 
 ### Features Used
 
-Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.
+Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.
 
 ```{r features_used, message=FALSE, echo=FALSE}
 library(dplyr)
@@ -87,11 +88,7 @@ hardcoded_descriptions <- tribble(
   "sale_day_of_year", "Numeric encoding of day of year (1 - 365)",
   "sale_day_of_month", "Numeric encoding of day of month (1 - 31)",
   "sale_day_of_week", "Numeric encoding of day of week (1 - 7)",
-  "sale_post_covid", "Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020)",
-  "strata_1",
-  glue("Condominium Building Strata - {condo_params$input$strata$k_1} Levels"),
-  "strata_2",
-  glue("Condominium Building Strata - {condo_params$input$strata$k_2} Levels")
+  "sale_post_covid", "Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020)"
 )
 # nolint end
 
@@ -209,7 +206,7 @@ We maintain a few useful resources for working with these features:
 
 - Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
 - You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data.
-- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
+- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html)). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
 
 ### Valuation
 
@@ -236,34 +233,6 @@ The condo model is trained on a select number of "multi-PIN sales" (or "multi-sa
 
 $$\frac{0.04}{0.04 + 0.01} * \$100,000 = \$80,000$$
 
-## Condo Strata
-
-The condo model uses an engineered feature called *strata* to deliver much of its predictive power. Strata is the binned, time-weighted, 5-year average sale price of the building. There are two strata features used in the model, one with `r condo_params$input$strata$k_1` bins and one with `r condo_params$input$strata$k_2` bins. Buildings are binned across each triad using either quantiles or 1-dimensional k-means. A visual representation of quantile-based strata binning looks like:
-
-![](docs/figures/strata.png)
-
-To put strata in more concrete terms, the table below shows a sample 5-level strata. Each condominium unit would be assigned a strata from this table (Strata 1, Strata 2, etc.) based on the 5-year weighted average sale price of its building. All units in a building will have the same strata.
-
-```{r strata, echo=FALSE}
-library(tibble)
-
-tribble(
-  ~"Strata", ~"Range of 5-year Average Sale Price",
-  "Strata 1", "$0 - $121K",
-  "Strata 2", "$121K - $149K",
-  "Strata 3", "$149K - $199K",
-  "Strata 4", "$199K - $276K",
-  "Strata 5", "$276K+"
-) %>%
-  knitr::kable(format = "markdown")
-```
-
-Some additional notes on strata:
-
-- Strata is calculated in the [ingest stage](./pipeline/00-ingest.R) of this repository.
-- Calculating the 5-year average sale price of a building requires at least 1 sale. Buildings with no sales have their strata imputed via KNN (using year built, number of units, and location as features).
-- Number of bins (`r condo_params$input$strata$k_1` and `r condo_params$input$strata$k_2`) was chosen based on model performance. These numbers yielded the lowest root mean-squared error (RMSE).
-
 # Ongoing Issues
 
 The CCAO faces a number of ongoing issues specific to condominium modeling. We are currently working on processes to fix these issues. We list the issues here for the sake of transparency and to provide a sense of the challenges we face.
@@ -272,24 +241,24 @@ The CCAO faces a number of ongoing issues specific to condominium modeling. We a
 
 The current modeling methodology for condominiums makes two assumptions:
 
-1. Condos units within the same building are similar and will sell for similar amounts.
+1. Condo units within the same building are similar and will sell for similar amounts.
 2. If units are not similar, the percentage of ownership will accurately reflect and be proportional to any difference in value between units.
 
-The model process works even in heterogeneous buildings as long as assumption 2 is met. For example, imagine a building with 8 identical units and 1 penthouse unit. This building violates assumption 1 because the penthouse unit is likely larger and worth more than the other 10. However, if the percentage of ownership of each unit is roughly proportional to its value, then each unit will still receive a fair assessment.
+The model process works even in heterogeneous buildings as long as assumption 2 is met. For example, imagine a building with 8 identical units and 1 penthouse unit. This building violates assumption 1 because the penthouse unit is likely larger and worth more than the other 8. However, if the percentage of ownership of each unit is roughly proportional to its value, then each unit will still receive a fair assessment.
 
 However, the model can produce poor results when both of these assumptions are violated. For example, if a building has an extreme mix of different units, each with the same percentage of ownership, then smaller, less expensive units will be overvalued and larger, more expensive units will be undervalued.
 
 This problem is rare, but does occur in certain buildings with many heterogeneous units. Such buildings typically go through a process of secondary review to ensure the accuracy of the individual unit values.
 
 ### Buildings With Few Sales
 
-The condo model relies on sales within the same building to calculate [strata](#condo-strata). This method works well for large buildings with many sales, but can break down when there are only 1 or 2 sales in a building. The primary danger here is _unrepresentative_ sales, i.e. sales that deviate significantly from the real average value of a building's units. When this happens, buildings can have their average unit sale value pegged too high or low.
+The condo model relies on sales within the same building to calculate a weighted, rolling average building sale price. This method works well for large buildings with many sales, but can break down when there are only 1 or 2 sales in a building. The primary danger here is _unrepresentative_ sales, i.e. sales that deviate significantly from the real average value of a building's units. When this happens, buildings can have their average unit sale value pegged too high or low.
 
 Fortunately, buildings without any recent sales are relatively rare, as condos have a higher turnover rate than single and multi-family property. Smaller buildings with low turnover are the most likely to not have recent sales.
 
 ### Buildings Without Sales
 
-When no sales have occurred in a building in the 5 years prior to assessment, the building's strata features are imputed. The model will look at nearby buildings that have similar unit counts/age and then try to assign an appropriate strata to the target building.
+When no sales have occurred in a building in the 5 years prior to assessment, the building's mean sale price feature is imputed. The model will look at nearby buildings that have similar unit counts, age, and other features, then try to assign an appropriate average to the target building.
 
 Most of the time, this technique produces reasonable results. However, buildings without sales still go through an additional round of review to ensure the accuracy of individual unit values.
 
@@ -303,11 +272,7 @@ As with the [residential model](https://github.com/ccao-data/model-res-avm), the
 
 * Location, location, location. Location is the largest driver of county-wide variation in condo value. We account for location using [geospatial features like neighborhood](#features-used).
 * Condo percentage of ownership, which determines the intra-building variation in unit price.
-* [Condo building strata](#condo-strata). Strata provides us with a good estimate of the average sale price of a building's units.
-
-**Q: How do I see my condo building's strata?**
-
-Individual building [strata](#condo-strata) are not included with assessment notices or shown on the CCAO's website. However, strata *are* stored in the sample data included in this repository. You can load the data ([`input/condo_strata_data.parquet`](./input/condo_strata_data.parquet)) using R and the `read_parquet()` function from the `arrow` library.
+* Other sales in the building. This is captured by a rolling average of sales in the building over the past 5 years, excluding any sales of the target condo unit.
 
 **Q: How do I see the assessed value of other units in my building?**