Piet Stam - Regression Analysis of Aggregate Data in R

Introduction

When dealing with large sets of data or sensitive individual-level information, researchers often use a technique to vertically aggregate the data. This involves combining the data into larger groups to conserve computer resources or maintain privacy.

But a key question arises: can statistical analyses based on this aggregated data still yield accurate results? The answer, at least for regression analysis, is a resounding yes. In fact, we can use the estimation results based on the aggregate data to derive the estimates of a linear regression based on the original, individual-level data.

To help illustrate this concept, I’ve written R code to reproduce an example from a scientific publication where the authors provided the data and code in SAS. With this example, you can see how to aggregate the data and derive the original regression results.

The example

Rahim Moineddin and Marcelo Luis Urquia have shown that the estimated coefficients of an individual-level regression model and aggregate-level regression model are identical in case of a continuous outcome variable and categorical predictors.¹ Although such equality does not hold for the standard errors of these point estimates, however, they show how to derive the individual-level standard errors from the aggregate data. Furthermore, they added SAS code and the example data set to show us the way.² In this post, I provide the R code to reproduce this example.

¹ They are the authors of the publication “Regression analysis of aggregate continuous data”.

² These can be found in their Supplemental Digital Content document.

Read data

The example data that I will use is in the file called wic.txt. These are the example data that the aforementioned authors use to illustrate the application of their method in SAS. In this example, they focused on the association between receipt of WIC (The Special Supplemental Nutrition Program for Women, Infants, and Children) food for the mother during this pregnancy and gestational weight gain. The example data are scraped from the fourth section of their Supplemental Digital Content document.

It contains four variables:

Weight gain during pregnancy in pounds (lb) (wtgain)
Receipt of WIC food (wic)
Race/ethnicity (mracethn)
Late or no prenatal care (latecare)

Let’s read the data.

library(readtext)
library(dplyr)
data <- readtext("https://cdn-links.lww.com/permalink/ede/a/ede_25_6_2014_07_23_urquia_ede14-408_sdc1.doc") %>% 
  subset(select = "text") %>% 
  { read.table(text=sub(".*4) individual-level dataset", "", .), header=TRUE) }

Apply transformations

Before we perform the regression, we apply data transformations in the same way that they are applied in the second section of the Supplemental Digital Content document. First, the weight gain during pregnancy in kilograms (wtgaink) is calculated by multiplication of the weight gain during pregnancy in pounds (lb) (wtgain) by the conversion factor 0.453592. Second, the categorical variables wic, mracethn and latecare are transformed to the factor type. Third, the reference level of the categorical variables mracethn is set to the last category called ‘Non-Hispanic Blacks’. Finally, the wtgain variable is no longer needed and therefore dropped.

db_indiv <- data %>%
  rename_all(tolower) %>%
  mutate(
    wtgaink = wtgain * 0.453592,
    wic = as.factor(wic),
    mracethn = as.factor(mracethn),
    mracethn = relevel(mracethn, ref = 3),
    latecare = as.factor(latecare)
  ) %>%
  select(wtgaink, wic, mracethn, latecare)

Fit regression model

The table presented in the publication by Moineddin and Urquia shows the estimation results of three models. The first model has a single predictor, the second model three predictors and the third model includes interaction effects among some of the aforementioned predictors. Note that in all three models, the continuous outcome wtgaink is predicted by a set of variables which are exclusively of the categorical type.

For reasons of brevity, let’s do the exercise for one of the three models. We choose the second model as this is also the model that is coded in SAS in the aforementioned Supplemental Digital Content document. Let’s first estimate the model parameters based on individual data. The results appear to be the same as to those in the aforementioned table, with one notable exception: the estimates of the constant term.³

³ I did a quick check and got the authors’ estimate of the constant term after some recoding of the predictor variables. You can do this quick check yourself with this gist.

model_indiv <-
  lm(
    formula = wtgaink ~ wic + mracethn + latecare, 
    data = db_indiv
  )
summ_indiv <- summary(model_indiv)
summ_indiv


Call:
lm(formula = wtgaink ~ wic + mracethn + latecare, data = db_indiv)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.2466  -4.1155  -0.6376   3.4447  28.6329 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.5874     0.2620  55.680  < 2e-16 ***
wicY          0.5652     0.2073   2.726  0.00642 ** 
mracethn1    -0.4928     0.2443  -2.017  0.04374 *  
mracethn2     1.0940     0.2232   4.902 9.76e-07 ***
latecare1    -0.2014     0.1670  -1.206  0.22773    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.942 on 5265 degrees of freedom
Multiple R-squared:  0.01552,   Adjusted R-squared:  0.01477 
F-statistic: 20.75 on 4 and 5265 DF,  p-value: < 2.2e-16

Aggregate data

Next let’s aggregate the data set. The aggregate data set contains averages of the outcome variable and frequencies of the categorical variables for all combinations of the set of categorical predictors. Thus, for all unique combinations of the categorical variables, we calculate the number of individual records N, the mean meany of the continuous variable wtgaink, and its standard deviation stdy. These aggregates suffice for our exercise to derive the individual-level point estimates and their standard errors afterwards.

db_aggr <-
  db_indiv %>%
  group_by(across(c(wic, mracethn, latecare))) %>%
  summarise(
    meany = mean(wtgaink),
    stdy = sd(wtgaink),
    N = length(wtgaink)
  ) %>%
  ungroup()

Derive original estimates

Now it’s time to derive the individual-level estimates of the coefficients and standard errors based on these aggregate results. Based on the aggregate data, a weighted regression model for the meany variable is fitted with N being the weight variable. Note that the variable stdy is not used in this step.

model_aggr <-
  lm(
    formula = meany ~ wic + mracethn + latecare,
    data = db_aggr,
    weights = N
  )
summ_aggr <- summary(model_aggr)
summ_aggr


Call:
lm(formula = meany ~ wic + mracethn + latecare, data = db_aggr, 
    weights = N)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-7.2204 -3.7597 -0.2019  4.1956  7.4323 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.5874     0.2682  54.380 1.86e-10 ***
wicY          0.5652     0.2122   2.663  0.03234 *  
mracethn1    -0.4928     0.2502  -1.970  0.08949 .  
mracethn2     1.0940     0.2285   4.788  0.00199 ** 
latecare1    -0.2014     0.1710  -1.178  0.27721    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.084 on 7 degrees of freedom
Multiple R-squared:  0.9188,    Adjusted R-squared:  0.8723 
F-statistic: 19.79 on 4 and 7 DF,  p-value: 0.0006444

Deriving the coefficients appears to be very easy, because their estimates at the aggregate level are identical to those at the individual level and therefore do not need any adjustment. A quick check confirms this equality.

all.equal(coef(model_indiv),coef(model_aggr))

[1] TRUE

In order to derive the standard errors, however, a number of data transformations are necessary. To understand the purpose of these transformations, notice that there exists a fixed ratio between the individual level standard errors and those at the aggregate level.⁴ With our data this fixed ratio is equal to 0.9766648 for all estimated coefficients:

⁴ This observation was also done here, using example data with both categorical outcome and predictor variables.

summ_indiv$coef[,2]/summ_aggr$coef[,2]

(Intercept)        wicY   mracethn1   mracethn2   latecare1 
  0.9766648   0.9766648   0.9766648   0.9766648   0.9766648

Thus, given the aggregate data and estimations results, it seems a logical strategy to first calculate this fixed ratio and then multiply it by the aggregate-level standard errors in order to arrive at the individual-level standard errors. This fixed ratio is called an inflation factor. The original SAS code for this procedure can be found in the Supplemental Digital Content document.

My R code for deriving the individual-level standard errors is given below. For clarity, I added a short description of what’s being calculated above each line of code. We estimate the inflation factor factor by the ratio of the square roots of the so-called pooled variance p_v and the aggregate-level residual variance errorms.⁵ This ratio is an estimate of the inflation factor, because p_v is an estimate of the variance of the individual-level outcome variable. Finally, the aggregate-level standard errors called StdErr are multiplied by this factor to derive the individual-level standard errors called standard_error that we are looking for.

⁵ See Wikipedia for a definition of pooled variance (assuming non-uniform sample sizes).

standard_errors <-
  db_aggr %>%
  mutate(
    # degrees of freedom used to calculate stdy
    n_1 = N - 1,
    # total sum of squares derived from stdy
    S_n_1 = stdy ^ 2 * n_1
  ) %>%
  summarise(
    # degrees of freedom
    n_k = sum(n_1),
    # total sum of squares
    S2_k = sum(S_n_1),
    # pooled variance (i.e. weighted average of group variances)
    p_v = S2_k / n_k,
    # aggregate-level residual variance
    errorms = summ_aggr$sigma ^ 2,
    # inflation factor
    factor = sqrt(p_v / errorms),
    # aggregate-level standard errors
    StdErr = summ_aggr$coef[ , 2],
    # individual-level standard errors
    standard_error = StdErr * factor
  )

Show table

The variable factor has 5 identical elements equal to 0.9766333, which differs only very slightly from the true inflation factor. The resulting standard errors are reported in the table below. These match the standard errors estimated by the individual-level regression as desired. In addition to the standard errors, the table also shows the estimated coefficients, the lower and upper bounds of the confidence intervals, the z-scores and the p-values. As a validity check, compare the estimates reported here with the individual-level regression estimates shown above and with the estimates reported in the Moineddin and Urquia publication.

library(insight)
standard_errors %>%
  mutate(
    estimate = coefficients(model_aggr),
    z = estimate / standard_error,
    p = 2 * (1 - pnorm(abs(z))),
    pvalue = format_p(p, stars = TRUE),
    CI_low = estimate - 1.96 * standard_error,
    CI_high = estimate + 1.96 * standard_error,
  ) %>%
  bind_cols("predictors" = names(standard_errors$standard_error)) %>% 
  select(predictors, estimate, standard_error, CI_low, CI_high, z, pvalue) %>%
  format_table(digits = 3, ci_digits = 3)

   predictors estimate standard_error               CI      z      pvalue
1 (Intercept)   14.587          0.262 [14.074, 15.101] 55.681 p < .001***
2        wicY    0.565          0.207 [ 0.159,  0.971]  2.726 p = 0.006**
3   mracethn1   -0.493          0.244 [-0.972, -0.014] -2.017  p = 0.044*
4   mracethn2    1.094          0.223 [ 0.657,  1.531]  4.902 p < .001***
5   latecare1   -0.201          0.167 [-0.529,  0.126] -1.206   p = 0.228

What’s next?

In the above example the ordinary least-squares algorithm was used to estimate the parameters of a linear regression. With the R code presented here, one can apply this algorithm to aggregate data and derive both point estimates and standard errors afterwards to reduce the use of scarce computer resources.⁶ As this procedure is especially useful for doing linear regression based on BIG data, a logical next question is whether vertical aggregation is also helpful with other algorithms usually applied to BIG data as well, such as random forest. This may be the subject of another blog post.

⁶ Here is a gist with the complete example code.

Happy coding!

Citation

BibTeX citation:

@online{stam2023,
  author = {Stam, Piet},
  title = {Regression {Analysis} of {Aggregate} {Data} in {R}},
  date = {2023-02-26},
  url = {https://www.pietstam.nl/posts/2023-02-26-ols-estimates-aggregate-data},
  langid = {en}
}

For attribution, please cite this work as:

Stam, Piet. 2023. “Regression Analysis of Aggregate Data in R.” February 26, 2023. https://www.pietstam.nl/posts/2023-02-26-ols-estimates-aggregate-data.