It took some time to figure out how to use frequency weights when fitting a model in the tidymodels framework. Here is the code to do that.
digital transformation
Author
Piet Stam
Published
September 19, 2022
Use case
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. The get started case study helps to take the first steps. Another helpful source is lesson 10 of an R tutorial from a data mining course at George Mason University.
Building on these basics, my next step is to apply frequency weights when estimating a linear regression model in the tidymodels way of coding. However, this blog post shows that this is a feature under development and therefore some of my first attempts to create a reproducible example failed.
The tidymodels how-to add case weights to a workflow gives some examples with code that helps to crack the case. Below I give the code for two reproducible examples, one example of model estimation without using weights and one with using weights.
Data and method
The models that I estimate are linear regression models with a set of predictors and one numeric outcome variable. The parameters of this model are estimated by ordinary least squares.
I use the car_prices data set for the examples and try to predict the car prices with the care brands as predictors. Note that, as a consequence, in my examples the outcome variable is non-negative and the predictors are mutually exclusive (0/1) dummy variables. This makes the examples easy to understand, but the code may apply to a wider range of variables nonetheless. I use mileage as the weighting variable.
Let us start with loading the data into memory.
# Load library for the recipe. parsnip, workflow and hardhat packages, along with the rest of tidymodelslibrary(tidymodels)
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
Now we select only the relevant variables. Although the weights are not yet used in the first example, mileage is already defined as the weighting variable in the data set.
# Create a data set with one non-negative continuous variable and uncorrelated dummy variables as predictorsdb <-select(car_prices, Price, Buick:Saturn, Mileage) %>%mutate(Mileage =frequency_weights(Mileage))str(db)
Now on with the first example. In the code below we define the recipe, define the model and set mode and engine. These are combined into a workflow. Afterwards we look at the properties of these objects to check if these are as expected. Note that Saturn is the reference dummy variable of my choice (i.e. in effect its coefficient is set to zero by default) and is thus excluded from the regression.
# Get data ready for modeling with recipe packagerecipe1 <- db %>%recipe(Price ~1+ Buick + Cadillac + Chevy + Pontiac + Saab) # add all dummy variables but one# Define model, mode and engine with parsnip packagemodel1 <-linear_reg() %>%# adds the basic model typeset_engine('lm') %>%# adds the computational engine to estimate the model parametersset_mode('regression') # adds the modeling context in which it will be used# Bundle pre-processing, modeling, and post-processing with workflow packageworkflow1 <-workflow() %>%add_recipe(recipe1) %>%add_model(model1)# View object propertiesrecipe1
Recipe
Inputs:
role #variables
outcome 1
predictor 5
model1
Linear Regression Model Specification (regression)
Computational engine: lm
workflow1
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Now that the objects look alright, the model estimation can be performed and the parameter estimates are printed.
# Now estimate the model via a single call to fit()fit1 <-fit(workflow1, data = db)# View fit1 propertiestidy(fit1)
Then come the weights. The first thought is to update the current workflow with a line of code to make clear that weights should be used. However, this approach does not produce the desired result.
Therefore, an alternative approach is followed. Instead of building upon the blocks of the first example, we start with a new workflow() object and add an add_case_weights line of code to it. Next, one would expect a line of code with an add_recipe command, but for some reason this did not work after a “few” tries. Instead, we use add_formula with the regression formula as an argument. Lastly, surprisingly conventional, an add_model command is added.
This is a nice first try! With the two examples above it is possible to experiment further in the hope of alternative/shorter routes to the estimation results. In the mean time, we wait for the tidymodels to include weights in the relevant packages. If you are inspired by these two examples (or not) and have some new ideas for progress, do not hesitate to give feedback to the Tidyverse developers.
Happy coding!
Citation
BibTeX citation:
@online{stam2022,
author = {Stam, Piet},
title = {Use Tidymodels with Weighted and Unweighted Data},
date = {2022-09-19},
url = {https://www.pietstam.nl/posts/2022-09-19-tidymodels-weighted-data},
langid = {en}
}