library(readtext)
library(dplyr)
<- readtext("https://cdn-links.lww.com/permalink/ede/a/ede_25_6_2014_07_23_urquia_ede14-408_sdc1.doc") %>%
data subset(select = "text") %>%
read.table(text=sub(".*4) individual-level dataset", "", .), header=TRUE) } {
Introduction
When dealing with large sets of data or sensitive individual-level information, researchers often use a technique to vertically aggregate the data. This involves combining the data into larger groups to conserve computer resources or maintain privacy.
But a key question arises: can statistical analyses based on this aggregated data still yield accurate results? The answer, at least for regression analysis, is a resounding yes. In fact, we can use the estimation results based on the aggregate data to derive the estimates of a linear regression based on the original, individual-level data.
To help illustrate this concept, I’ve written R code to reproduce an example from a scientific publication where the authors provided the data and code in SAS. With this example, you can see how to aggregate the data and derive the original regression results.
The example
Rahim Moineddin and Marcelo Luis Urquia have shown that the estimated coefficients of an individual-level regression model and aggregate-level regression model are identical in case of a continuous outcome variable and categorical predictors.1 Although such equality does not hold for the standard errors of these point estimates, however, they show how to derive the individual-level standard errors from the aggregate data. Furthermore, they added SAS code and the example data set to show us the way.2 In this post, I provide the R code to reproduce this example.
1 They are the authors of the publication “Regression analysis of aggregate continuous data”.
2 These can be found in their Supplemental Digital Content document.
Read data
The example data that I will use is in the file called wic.txt
. These are the example data that the aforementioned authors use to illustrate the application of their method in SAS. In this example, they focused on the association between receipt of WIC (The Special Supplemental Nutrition Program for Women, Infants, and Children) food for the mother during this pregnancy and gestational weight gain. The example data are scraped from the fourth section of their Supplemental Digital Content document.
It contains four variables:
Weight gain during pregnancy in pounds (lb) (
wtgain
)Receipt of WIC food (
wic
)Race/ethnicity (
mracethn
)Late or no prenatal care (
latecare
)
Let’s read the data.
Apply transformations
Before we perform the regression, we apply data transformations in the same way that they are applied in the second section of the Supplemental Digital Content document. First, the weight gain during pregnancy in kilograms (wtgaink
) is calculated by multiplication of the weight gain during pregnancy in pounds (lb) (wtgain
) by the conversion factor 0.453592
. Second, the categorical variables wic
, mracethn
and latecare
are transformed to the factor type. Third, the reference level of the categorical variables mracethn
is set to the last category called ‘Non-Hispanic Blacks’. Finally, the wtgain
variable is no longer needed and therefore dropped.
<- data %>%
db_indiv rename_all(tolower) %>%
mutate(
wtgaink = wtgain * 0.453592,
wic = as.factor(wic),
mracethn = as.factor(mracethn),
mracethn = relevel(mracethn, ref = 3),
latecare = as.factor(latecare)
%>%
) select(wtgaink, wic, mracethn, latecare)
Fit regression model
The table presented in the publication by Moineddin and Urquia shows the estimation results of three models. The first model has a single predictor, the second model three predictors and the third model includes interaction effects among some of the aforementioned predictors. Note that in all three models, the continuous outcome wtgaink
is predicted by a set of variables which are exclusively of the categorical type.
For reasons of brevity, let’s do the exercise for one of the three models. We choose the second model as this is also the model that is coded in SAS in the aforementioned Supplemental Digital Content document. Let’s first estimate the model parameters based on individual data. The results appear to be the same as to those in the aforementioned table, with one notable exception: the estimates of the constant term.3
3 I did a quick check and got the authors’ estimate of the constant term after some recoding of the predictor variables. You can do this quick check yourself with this gist.
<-
model_indiv lm(
formula = wtgaink ~ wic + mracethn + latecare,
data = db_indiv
)<- summary(model_indiv)
summ_indiv summ_indiv
Call:
lm(formula = wtgaink ~ wic + mracethn + latecare, data = db_indiv)
Residuals:
Min 1Q Median 3Q Max
-16.2466 -4.1155 -0.6376 3.4447 28.6329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.5874 0.2620 55.680 < 2e-16 ***
wicY 0.5652 0.2073 2.726 0.00642 **
mracethn1 -0.4928 0.2443 -2.017 0.04374 *
mracethn2 1.0940 0.2232 4.902 9.76e-07 ***
latecare1 -0.2014 0.1670 -1.206 0.22773
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.942 on 5265 degrees of freedom
Multiple R-squared: 0.01552, Adjusted R-squared: 0.01477
F-statistic: 20.75 on 4 and 5265 DF, p-value: < 2.2e-16
Aggregate data
Next let’s aggregate the data set. The aggregate data set contains averages of the outcome variable and frequencies of the categorical variables for all combinations of the set of categorical predictors. Thus, for all unique combinations of the categorical variables, we calculate the number of individual records N
, the mean meany
of the continuous variable wtgaink
, and its standard deviation stdy
. These aggregates suffice for our exercise to derive the individual-level point estimates and their standard errors afterwards.
<-
db_aggr %>%
db_indiv group_by(across(c(wic, mracethn, latecare))) %>%
summarise(
meany = mean(wtgaink),
stdy = sd(wtgaink),
N = length(wtgaink)
%>%
) ungroup()
Derive original estimates
Now it’s time to derive the individual-level estimates of the coefficients and standard errors based on these aggregate results. Based on the aggregate data, a weighted regression model for the meany
variable is fitted with N
being the weight variable. Note that the variable stdy
is not used in this step.
<-
model_aggr lm(
formula = meany ~ wic + mracethn + latecare,
data = db_aggr,
weights = N
)<- summary(model_aggr)
summ_aggr summ_aggr
Call:
lm(formula = meany ~ wic + mracethn + latecare, data = db_aggr,
weights = N)
Weighted Residuals:
Min 1Q Median 3Q Max
-7.2204 -3.7597 -0.2019 4.1956 7.4323
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.5874 0.2682 54.380 1.86e-10 ***
wicY 0.5652 0.2122 2.663 0.03234 *
mracethn1 -0.4928 0.2502 -1.970 0.08949 .
mracethn2 1.0940 0.2285 4.788 0.00199 **
latecare1 -0.2014 0.1710 -1.178 0.27721
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.084 on 7 degrees of freedom
Multiple R-squared: 0.9188, Adjusted R-squared: 0.8723
F-statistic: 19.79 on 4 and 7 DF, p-value: 0.0006444
Deriving the coefficients appears to be very easy, because their estimates at the aggregate level are identical to those at the individual level and therefore do not need any adjustment. A quick check confirms this equality.
all.equal(coef(model_indiv),coef(model_aggr))
[1] TRUE
In order to derive the standard errors, however, a number of data transformations are necessary. To understand the purpose of these transformations, notice that there exists a fixed ratio between the individual level standard errors and those at the aggregate level.4 With our data this fixed ratio is equal to 0.9766648 for all estimated coefficients:
4 This observation was also done here, using example data with both categorical outcome and predictor variables.
$coef[,2]/summ_aggr$coef[,2] summ_indiv
(Intercept) wicY mracethn1 mracethn2 latecare1
0.9766648 0.9766648 0.9766648 0.9766648 0.9766648
Thus, given the aggregate data and estimations results, it seems a logical strategy to first calculate this fixed ratio and then multiply it by the aggregate-level standard errors in order to arrive at the individual-level standard errors. This fixed ratio is called an inflation factor. The original SAS code for this procedure can be found in the Supplemental Digital Content document.
My R code for deriving the individual-level standard errors is given below. For clarity, I added a short description of what’s being calculated above each line of code. We estimate the inflation factor factor
by the ratio of the square roots of the so-called pooled variance p_v
and the aggregate-level residual variance errorms
.5 This ratio is an estimate of the inflation factor, because p_v
is an estimate of the variance of the individual-level outcome variable. Finally, the aggregate-level standard errors called StdErr
are multiplied by this factor
to derive the individual-level standard errors called standard_error
that we are looking for.
5 See Wikipedia for a definition of pooled variance (assuming non-uniform sample sizes).
<-
standard_errors %>%
db_aggr mutate(
# degrees of freedom used to calculate stdy
n_1 = N - 1,
# total sum of squares derived from stdy
S_n_1 = stdy ^ 2 * n_1
%>%
) summarise(
# degrees of freedom
n_k = sum(n_1),
# total sum of squares
S2_k = sum(S_n_1),
# pooled variance (i.e. weighted average of group variances)
p_v = S2_k / n_k,
# aggregate-level residual variance
errorms = summ_aggr$sigma ^ 2,
# inflation factor
factor = sqrt(p_v / errorms),
# aggregate-level standard errors
StdErr = summ_aggr$coef[ , 2],
# individual-level standard errors
standard_error = StdErr * factor
)
Show table
The variable factor
has 5 identical elements equal to 0.9766333, which differs only very slightly from the true inflation factor. The resulting standard errors are reported in the table below. These match the standard errors estimated by the individual-level regression as desired. In addition to the standard errors, the table also shows the estimated coefficients, the lower and upper bounds of the confidence intervals, the z-scores and the p-values. As a validity check, compare the estimates reported here with the individual-level regression estimates shown above and with the estimates reported in the Moineddin and Urquia publication.
library(insight)
%>%
standard_errors mutate(
estimate = coefficients(model_aggr),
z = estimate / standard_error,
p = 2 * (1 - pnorm(abs(z))),
pvalue = format_p(p, stars = TRUE),
CI_low = estimate - 1.96 * standard_error,
CI_high = estimate + 1.96 * standard_error,
%>%
) bind_cols("predictors" = names(standard_errors$standard_error)) %>%
select(predictors, estimate, standard_error, CI_low, CI_high, z, pvalue) %>%
format_table(digits = 3, ci_digits = 3)
predictors estimate standard_error CI z pvalue
1 (Intercept) 14.587 0.262 [14.074, 15.101] 55.681 p < .001***
2 wicY 0.565 0.207 [ 0.159, 0.971] 2.726 p = 0.006**
3 mracethn1 -0.493 0.244 [-0.972, -0.014] -2.017 p = 0.044*
4 mracethn2 1.094 0.223 [ 0.657, 1.531] 4.902 p < .001***
5 latecare1 -0.201 0.167 [-0.529, 0.126] -1.206 p = 0.228
What’s next?
In the above example the ordinary least-squares algorithm was used to estimate the parameters of a linear regression. With the R code presented here, one can apply this algorithm to aggregate data and derive both point estimates and standard errors afterwards to reduce the use of scarce computer resources.6 As this procedure is especially useful for doing linear regression based on BIG data, a logical next question is whether vertical aggregation is also helpful with other algorithms usually applied to BIG data as well, such as random forest. This may be the subject of another blog post.
6 Here is a gist with the complete example code.
Happy coding!
Citation
@online{stam2023,
author = {Stam, Piet},
title = {Regression {Analysis} of {Aggregate} {Data} in {R}},
date = {2023-02-26},
url = {https://www.pietstam.nl/posts/2023-02-26-ols-estimates-aggregate-data},
langid = {en}
}