Author: Maciej Nasinski
Check the miceFast website for more details
miceFast provides fast methods for imputing missing data, leveraging an object-oriented programming paradigm and optimized linear algebra routines.
The package includes convenient helper functions compatible with data.table, dplyr, and other popular R packages.
Major speed improvements occur when:
- Using a grouping variable, where the data is automatically sorted by group, significantly reducing computation time. - Performing multiple imputations, by evaluating the underlying quantitative model only once for multiple draws. - Running Predictive Mean Matching (PMM), thanks to presorting and binary search.
For performance details, see performance_validity.R
in the extdata
folder.
You can install miceFast from CRAN:
install.packages("miceFast")
Or install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("polkas/miceFast")
Below is a short demonstration. See the vignette for advanced usage and best practices.
library(miceFast)
set.seed(1234)
data(air_miss)
# Visualize the NA structure
upset_NA(air_miss, 6)
# Simple and naive fill
imputed_data <- naive_fill_NA(air_miss)
# Compare with other packages:
# Hmisc
library(Hmisc)
data.frame(Map(function(x) Hmisc::impute(x, "random"), air_miss))
# mice
library(mice)
mice::complete(mice::mice(air_miss, printFlag = FALSE))
Multiple imputations are performed in a loop where a continuous variable is imputed using a Bayesian linear model (lm_bayes) that incorporates relevant predictors and weights for robust estimation. Simultaneously, a categorical variable is imputed using linear discriminant analysis (LDA) augmented with a randomly generated ridge penalty.
library(dplyr)
# Define a function that performs the imputation on the dataset
impute_data <- function(data) {
data %>%
mutate(
# Impute the continuous variable using lm_bayes
Solar_R_imp = fill_NA(
x = .,
model = "lm_bayes",
posit_y = "Solar.R",
posit_x = c("Wind", "Temp", "Intercept"),
w = weights # assuming 'weights' is a column in data
),
# Impute the categorical variable using lda with a random ridge parameter
Ozone_chac_imp = fill_NA(
x = .,
model = "lda",
posit_y = "Ozone_chac",
posit_x = c("Wind", "Temp"),
ridge = runif(1, 0, 50)
)
)
}
# Set seed for reproducibility
set.seed(123456)
# Run the imputation process 3 times using replicate()
# This returns a list of imputed datasets.
res <- replicate(n = 3, expr = impute_data(air_miss), simplify = FALSE)
# Check results: Calculate the mean of the imputed Solar.R values in each dataset
means_imputed <- lapply(res, function(x) mean(x$Solar_R_imp, na.rm = TRUE))
print(means_imputed)
# Check results: Tabulate the imputed categorical variable for each dataset
tables_imputed <- lapply(res, function(x) table(x$Ozone_chac_imp))
print(tables_imputed)
miceFast
objects (Rcpp modules).fill_NA()
: Single imputation (lda
, lm_pred
, lm_bayes
, lm_noise
).fill_NA_N()
: Multiple imputations (pmm
, lm_bayes
, lm_noise
).VIF()
: Variance Inflation Factor calculations.naive_fill_NA()
: Automatic naive imputations.compare_imp()
: Compare original vs. imputed values.upset_NA()
: Visualize NA structure using UpSetR.Quick Reference Table:
Function | Description |
---|---|
new(miceFast) |
Creates an OOP instance with numerous imputation methods (see the vignette). |
fill_NA() |
Single imputation: lda , lm_pred , lm_bayes , lm_noise . |
fill_NA_N() |
Multiple imputations (N repeats): pmm , lm_bayes , lm_noise . |
VIF() |
Computes Variance Inflation Factors. |
naive_fill_NA() |
Performs automatic, naive imputations. |
compare_imp() |
Compares imputations vs. original data. |
upset_NA() |
Visualizes NA structure using an UpSet plot. |
Benchmark testing (on R 4.4.3, macOS M3 Pro, optimized BLAS and LAPACK) shows miceFast can significantly reduce computation time, especially in these scenarios:
x * (number of multiple imputations)
faster, since the model is computed only once.For performance details, see performance_validity.R
in the extdata
folder.