miceFast - Introduction • miceFast

Overview

This vignette introduces the miceFast package, which provides fast imputations in an object-oriented programming style.
The underlying methods rely on quantitative models with closed-form solutions implemented via linear algebra operations.

Additional utility functions are provided for easy integration with popular R packages such as dplyr and data.table.

Installation and Setup

Before you begin, ensure the miceFast package is installed. You may also want to install a high-performance BLAS library for significant speedups.

BLAS

Recommended: Install an optimized BLAS library (potentially up to 100x faster):

Linux: sudo apt-get install libopenblas-dev

macOS (Accelerate/vecLib BLAS):

cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

Loading the Package and Setting a Seed

pkgs <- c("miceFast", "mice", "ggplot2", "dplyr", "data.table")
inst <- lapply(pkgs, library, character.only = TRUE)
set.seed(123456)

Description

The miceFast package accelerates imputations for quantitative models (with closed-form solutions) by using efficient linear algebra routines. It offers substantial speed improvements in:

Linear Discriminant Analysis (up to 5x faster),
Grouping-based calculations (often 10x or more, depending on data size, number of groups, etc.),
Multiple imputations (faster by approximately the number of imputations, since the core model is computed only once),
Variance Inflation Factor (VIF) computations (around 5x faster by skipping redundant steps),
Predictive Mean Matching (PMM) (around 3x faster thanks to pre-sorting and binary search).

Note: For details on the performance-testing procedure, see the file performance_validity.R in the extdata folder. Additional simulation plots (which you are free to modify) are located in extdata/images.
Environment: R 4.2 on Mac M1.

Motivations

Missing data is a common issue in data analysis. One approach is to delete any row containing missing observations, though this can degrade the quality or representativeness of your dataset. Another, often better, approach is to use imputation methods (such as multiple or regular imputations) to fill in missing data, leveraging observed independent variables.

Since R and Python are user-friendly but can be slower for heavy computation, miceFast uses efficient C++/RcppArmadillo routines under the hood to speed up the imputation process.

Introduction for dplyr/data.table Users

The miceFast package provides additional functions (fill_NA and fill_NA_N) that are designed to be stable in the presence of user-level mistakes or complex data structures. Below are minimal examples using dplyr and data.table.

Example Data

data(air_miss)  # airquality with additional variables

Examining Missingness

upset_NA(air_miss, 6)

Using dplyr

Below we show how to:

Calculate VIF.
Perform various imputations with and without grouping.
Safely handle collinearity or small group sizes by wrapping calls in tryCatch.

# VIF (Variance Inflation Factor) > ~10 suggests collinearity problems.
VIF(
  air_miss,
  posit_y = "Ozone",
  posit_x = c("Solar.R", "Wind", "Temp", "x_character",
              "Day", "weights", "groups")
)

##           [,1]
## [1,] 24.978996
## [2,]  1.445308
## [3,]  3.077776
## [4,] 42.248792
## [5,]  1.083795
## [6,]  1.100853
## [7,]  2.954588

air_miss <- air_miss %>%
  # Impute with grouping and weights
  group_by(groups) %>%
  do(mutate(., 
    Solar_R_imp = fill_NA(
      x = .,
      model = "lm_pred",
      posit_y = "Solar.R",
      posit_x = c("Wind", "Temp", "Intercept"),
      w = .[["weights"]]
    )
  )) %>%
  ungroup() %>%
  # Impute a discrete variable
  mutate(x_character_imp = fill_NA(
    x = .,
    model = "lda",
    posit_y = "x_character",
    posit_x = c("Wind", "Temp")
  )) %>%
  # Log regression because Ozone is roughly log-normal
  mutate(Ozone_imp1 = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept"),
    logreg = TRUE
  )) %>%
  # Impute by position indices
  mutate(Ozone_imp2 = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = 1,
    posit_x = c(4, 6),
    logreg = TRUE
  )) %>%
  # Imputations (mean of 30 draws)
  mutate(Ozone_imp3 = fill_NA_N(
    x = .,
    model = "lm_noise",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 30
  )) %>%
  mutate(Ozone_imp4 = fill_NA_N(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 30
  )) %>%
  group_by(groups) %>%
  # Single evaluation model
  do(mutate(.,
    Ozone_imp5 = fill_NA(
      x = .,
      model = "lm_pred",
      posit_y = "Ozone",
      posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
      w = .[["weights"]],
      logreg = TRUE
    )
  )) %>%
  do(mutate(.,
    Ozone_imp6 = fill_NA_N(
      x = .,
      model = "pmm",
      posit_y = "Ozone",
      posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
      w = .[["weights"]],
      logreg = TRUE,
      k = 20
    )
  )) %>%
  ungroup() %>%
  # Combine imputations
  mutate(Ozone_imp_mix = rowMeans(select(., starts_with("Ozone_imp")))) %>%
  # Protect against errors in small groups
  group_by(groups) %>%
  do(mutate(.,
    Ozone_chac_imp = tryCatch(
      fill_NA(
        x = .,
        model = "lda",
        posit_y = "Ozone_chac",
        posit_x = c("Intercept", "Month", "Day", "Temp", "x_character_imp"),
        w = .[["weights"]]
      ),
      error = function(e) .[["Ozone_chac"]]
    )
  )) %>%
  ungroup()

Visualizing Results

# Compare original vs. imputed distributions
compare_imp(air_miss, origin = "Ozone", target = "Ozone_imp_mix")

# Or compare multiple imputation columns
compare_imp(air_miss, origin = "Ozone", target = c("Ozone_imp2", "Ozone_imp_mix"))

Using data.table

Below is a similar approach using data.table syntax.

data(air_miss)
setDT(air_miss)

# Imputations
air_miss[, Solar_R_imp := fill_NA_N(
  x = .SD,
  model = "lm_bayes",
  posit_y = "Solar.R",
  posit_x = c("Wind","Temp","Intercept"),
  w = .SD[["weights"]],
  k = 100
), by = .(groups)] %>%
  .[, x_character_imp := fill_NA(
    x = .SD,
    model = "lda",
    posit_y = "x_character",
    posit_x = c("Wind","Temp","groups")
  )] %>%
  .[, Ozone_imp1 := fill_NA(
    x = .SD,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept"),
    logreg = TRUE
  )] %>%
  .[, Ozone_imp2 := fill_NA(
    x = .SD,
    model = "lm_bayes",
    posit_y = 1,
    posit_x = c(4,6),
    logreg = TRUE
  )] %>%
  .[, Ozone_imp3 := fill_NA_N(
    x = .SD,
    model = "lm_noise",
    posit_y = "Ozone",
    posit_x = c("Intercept","x_character_imp","Wind","Temp"),
    w = .SD[["weights"]],
    logreg = TRUE,
    k = 30
  )] %>%
  .[, Ozone_imp4 := fill_NA_N(
    x = .SD,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept","x_character_imp","Wind","Temp"),
    w = .SD[["weights"]],
    logreg = TRUE,
    k = 30
  )] %>%
  .[, Ozone_imp5 := fill_NA(
    x = .SD,
    model = "lm_pred",
    posit_y = "Ozone",
    posit_x = c("Intercept","x_character_imp","Wind","Temp"),
    w = .SD[["weights"]],
    logreg = TRUE
  ), by = .(groups)] %>%
  .[, Ozone_imp6 := fill_NA_N(
    x = .SD,
    model = "pmm",
    posit_y = "Ozone",
    posit_x = c("Intercept","x_character_imp","Wind","Temp"),
    w = .SD[["weights"]],
    logreg = TRUE,
    k = 10
  ), by = .(groups)] %>%
  # Average across a set of methods
  .[, Ozone_imp_mix := apply(.SD, 1, mean), .SDcols = Ozone_imp1:Ozone_imp6] %>%
  # tryCatch for small group issues or collinearity
  .[, Ozone_chac_imp := tryCatch(
    fill_NA(
      x = .SD,
      model = "lda",
      posit_y = "Ozone_chac",
      posit_x = c("Intercept","Month","Day","Temp","x_character_imp"),
      w = .SD[["weights"]]
    ),
    error = function(e) .SD[["Ozone_chac"]]
  ), by = .(groups)]

Generating Data with the corrData Module

You can generate correlated data via corrData:

# Constructors:
new(corrData, nr_cat, n_obs, means, cor_matrix)
new(corrData, n_obs, means, cor_matrix)

# Some relevant methods:
cd_obj$fill("type")  # "contin", "binom", or "discrete"

nr_cat: number of categories for a discrete dependent variable
n_obs: number of observations
means: centers for independent variables
cor_matrix: a positive-definite correlation matrix

Example usage:

Imputing Data with the miceFast Module

The miceFast module performs fast imputations via various built-in models. Below is a summary of its main methods:

set_data(x): Provide the data as a numeric matrix by reference.
set_g(g): Provide a grouping variable (numeric vector, must be positive, no NAs).
set_w(w): Provide a weighting variable (numeric vector, must be positive, no NAs).
impute(model, posit_y, posit_x): Single imputation according to the chosen model.
impute_N(model, posit_y, posit_x, k): Multiple imputations (averaged result) for certain models.
update_var(posit_y, imputations): Permanently update the object’s data for the specified column with new imputations.
vifs(posit_y, posit_x): Compute Variance Inflation Factors.
sort_byg() / is_sorted_byg(): Sort the data by the grouping variable (or check if sorted).
which_updated(): Check which variables have been permanently updated in the object.

Supported models:

"lda": Linear discriminant analysis (for categorical posit_y).
"lm_pred": Standard linear model prediction (including a mean-imputation approach if you only include an intercept).
"lm_bayes": Linear model with Bayesian noise.
"lm_noise": Similar to "lm_bayes" but with different noise assumptions.
"pmm": Predictive Mean Matching (used via impute_N).

Note: For purely mean imputation, add an intercept column and use "lm_pred". The LDA model requires at least 15 complete observations. Linear models require at least as many complete observations as the number of predictors.

Examples

Simple Usage

data <- cbind(as.matrix(mice::nhanes), intercept = 1, index = 1:nrow(mice::nhanes))
model <- new(miceFast)
model$set_data(data)

# Single imputation, linear model
imp_result <- model$impute("lm_pred", posit_y = 2, posit_x = 5)
model$update_var(2, imp_result$imputations)

# LDA imputation for a categorical variable
model$update_var(3, model$impute("lda", 3, c(1,2))$imputations)

# Multiple imputation example
multi_imp <- model$impute_N("lm_bayes", 4, c(1,2,3), k = 10)
model$update_var(4, multi_imp$imputations)

# Check which variables were updated
model$which_updated()

## [1] 2 3 4

# Always be cautious with 'update_var' 
# since it permanently alters the object and your data matrix.

Example with Weights and Groups (Pre-sorted)

data <- cbind(as.matrix(airquality[,-5]), intercept = 1, index = 1:nrow(airquality))
weights <- rgamma(nrow(data), 3, 3)
groups <- as.numeric(airquality[,5]) 

model <- new(miceFast)
model$set_data(data)
model$set_w(weights)
model$set_g(groups)

# Impute with group-wise weighting
model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), k = 10)$imputations)

# Inspect updated data, returning to original order via model$get_index()
head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]      [,9]
## [1,]   41  190  7.4   67    1    1    1    5 1.2183624
## [2,]   36  118  8.0   72    2    1    2    5 1.2866076
## [3,]   12  149 12.6   74    3    1    3    5 0.7953631
## [4,]   18  313 11.5   62    4    1    4    5 0.7386522

Example with Unsorted Groups

data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality))
weights <- rgamma(nrow(data), 3, 3)
groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE))

model <- new(miceFast)
model$set_data(data)
model$set_w(weights)
model$set_g(groups)

# The data will be automatically sorted by 'groups' during the first imputation:
model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)

## Warning in model$impute("lm_pred", 1, 6): 
##  Data was sorted by the grouping variable - use `get_index()` to retrieve an order

model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), 10)$imputations)

head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]      [,9]
## [1,]   41  190  7.4   67    1    1    1    3 0.8644072
## [2,]   36  118  8.0   72    2    1    2    7 0.5708125
## [3,]   12  149 12.6   74    3    1    3    6 0.3919961
## [4,]   18  313 11.5   62    4    1    4    8 1.0201001

Tips and Additional Notes

Creating matrices from data frames: Remember that standard R matrices can only hold one data type (often numeric). Factor and character columns must be transformed (e.g., via model.matrix).
```
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
mtcars_mat <- model.matrix(~ ., data = mtcars, na.action = "na.pass")
```

Variance Inflation Factors (VIF): VIF measures how much the variance of a regression coefficient is inflated due to collinearity. High values (>10) may indicate problematic collinearity.

airquality2 <- airquality
airquality2$Temp2 <- airquality2$Temp^2
airquality2$Month <- factor(airquality2$Month)

# Using data.table:
VIF(
  airquality2,
  posit_y = "Ozone",
  posit_x = c("Solar.R", "Wind", "Temp", "Month", "Day", "Temp2"),
  correct = FALSE
)

References

Exposing C++ functions and classes with Rcpp modules
Dirk Eddelbuettel and Romain François (2018)
https://dirk.eddelbuettel.com/code/rcpp/Rcpp-modules.pdf
RcppArmadillo: Easily Extending R with High-Performance C++ Code
Dirk Eddelbuettel and Conrad Sanderson (2012)
https://dirk.eddelbuettel.com/papers/RcppArmadillo-intro.pdf
Extending R with C++: A Brief Introduction to Rcpp
Dirk Eddelbuettel and James Joseph Balamuta (2018)
https://dirk.eddelbuettel.com/code/rcpp/Rcpp-introduction.pdf
MICE: Multivariate Imputation by Chained Equations in R
Stef van Buuren (2013)
CSCE 666: Pattern Analysis
Ricardo Gutierrez-Osuna (Fall 2013)

miceFast - Introduction

Maciej Nasinski

2025-04-15