Evaluates whether machine-learning models used in the ml argument of
cat2cat actually improve category assignment over simpler
baselines. The function runs a per-group train/test split across every
mapping group defined by the transition table.
Usage
cat2cat_ml_run(mappings, ml, ...)
# S3 method for class 'cat2cat_ml_run'
print(x, ...)Arguments
- mappings
`named list` with 3 fields `trans`, `direction` and optional `freqs_df`.
- ml
`named list` (optional) with fields `data`, `cat_var`, `method`, `features` and optional `args`, `on_fail`, `fail_warn`.
- ...
other arguments (currently unused).
- x
cat2cat_ml_runinstance created withcat2cat_ml_run.
Value
An object of class "cat2cat_ml_run" (a named list). Each
element corresponds to one mapping group and contains:
naivenumeric(1)— random-guess baseline accuracy (1/k where k is number of categories). Theoretical lower bound.freqnumeric(1)— most-frequent-category baseline accuracy. A simple but often strong baseline.accNamed
numericvector — test-set accuracy for each ML method. Higher is better; compare tofreq.brierNamed
numericvector — Brier score for each ML method. Computed asmean((1 - P(true))^2). Lower is better; range is [0, 1].mean_probNamed
numericvector — average probability assigned to the true class. Higher is better. This directly measures the quality of probability weights used by cat2cat.naive_briernumeric(1)— Brier score for uniform baseline (= (1 - 1/k)^2). Serves as a calibration reference.naive_mean_probnumeric(1)— mean P(true) for uniform baseline (= 1/k). Equalsnaiveby definition.freq_briernumeric(1)— Brier score using training set category frequencies as probabilities.freq_mean_probnumeric(1)— mean P(true) using training set category frequencies.
The object also carries an "ml_models" attribute listing the
methods evaluated. Use print() for a human-readable summary.
print returns x invisibly, after printing a summary
with average accuracy per method, baseline comparisons, and failure rates.
Details
For each mapping group (set of candidate categories linked by the transition table) the function:
Collects all observations from
ml$datawhose category belongs to the group.Randomly splits them into training (
1 - test_prop) and test (test_prop) sets.Computes two baselines on the test set:
naive: accuracy of a random guess (\(1 / k\) where \(k\) is the number of candidate categories).
freq: accuracy of always predicting the most frequent category in the training set.
Trains each ML model specified in
ml$methodon the training set and records its classification accuracy on the test set.
Groups with fewer than 5 observations or only one candidate category are
skipped (their accuracy is recorded as NA).
Baseline-Only Diagnostics
To inspect only baseline diagnostics (naive, freq, and their
Brier/mean-probability variants), pass empty model and feature vectors:
ml$method = character(0) and ml$features = character(0).
In this mode, no ML models are trained, but baseline diagnostics are still
computed for each mapping group.
Understanding the Metrics
Three complementary metrics evaluate model quality:
Accuracy measures how often the model's top prediction matches the true category. Use this when you only care about the single most likely assignment. Higher is better; theoretical maximum is 1.0.
Mean P(true class) measures the average probability the model assigns to the correct category. This evaluates the full probability distribution, not just the top prediction. For cat2cat, where weights ARE probabilities distributed across candidates, this metric directly measures weight quality. Higher is better; range is \([0, 1]\).
Brier score measures the squared error between predicted probability and the true outcome: \((1 - P(true))^2\). Unlike log-loss, Brier score is bounded \([0, 1]\) and does not explode when P(true) is near zero. Lower is better; 0 means perfect prediction. For k categories, the naive baseline (uniform 1/k) gives Brier = \((1 - 1/k)^2\).
Choosing a Method
If accuracy and mean P(true) are similar across methods, prefer simpler methods (freq, lda) over complex ones (rf, knn).
If ML methods rarely beat the frequency baseline, use
wei_freq_c2c— ML adds complexity without benefit.If Brier score for ML is similar to or worse than naive, the model is not well-calibrated. Consider
wei_freq_c2cor a different ML method.Use
cross_c2cto ensemble multiple methods if no single method dominates.
Because the split is random, results will vary between runs. For more
stable estimates, call the function several times or use a larger
ml$data set (e.g. pool multiple survey waves).
Examples
# \donttest{
library("cat2cat")
data("occup", package = "cat2cat")
data("trans", package = "cat2cat")
occup_2006 <- occup[occup$year == 2006, ]
occup_2008 <- occup[occup$year == 2008, ]
# Forward direction: old encoding -> new encoding
# Use data from OLD encoding periods (2006, 2008)
ml_setup <- list(
data = rbind(occup_2006, occup_2008),
cat_var = "code",
method = c("knn", "rf", "lda", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50)
)
mappings <- list(trans = trans, direction = "forward")
set.seed(1234)
res <- cat2cat_ml_run(mappings, ml_setup, test_prop = 0.2)
print(res)
#> === cat2cat ML Cross-Validation Results ===
#>
#> ACCURACY (higher is better):
#> naive (1/k): 0.4380
#> freq (most common): 0.6447
#> knn: accuracy = 0.6424
#> rf: accuracy = 0.7026
#> lda: accuracy = 0.7075
#> nb: accuracy = 0.6870
#>
#> BRIER SCORE (lower is better, range 0-1):
#> naive: 0.3277
#> freq: 0.2512
#> knn: brier = 0.2687
#> rf: brier = 0.2191
#> lda: brier = 0.2141
#> nb: brier = 0.2414
#>
#> MEAN P(TRUE CLASS) (higher is better):
#> naive: 0.4380
#> freq: 0.5866
#> knn: mean P(true) = 0.5919
#> rf: mean P(true) = 0.6430
#> lda: mean P(true) = 0.6588
#> nb: mean P(true) = 0.6478
#>
#> ACCURACY: ML vs BASELINES (percent of groups where ML wins):
#> knn > naive: 80.0%
#> rf > naive: 97.7%
#> lda > naive: 90.5%
#> nb > naive: 90.7%
#> knn > freq: 20.0%
#> rf > freq: 41.9%
#> lda > freq: 40.5%
#> nb > freq: 39.5%
#>
#> SKIPPED GROUPS (single category or <5 observations):
#> knn: 98.3%
#> rf: 98.3%
#> lda: 98.4%
#> nb: 98.3%
# Typical good results show:
# - ML accuracy > freq baseline (ML adds value)
# - ML Brier < naive (well-calibrated probabilities)
# - ML mean P(true) > freq (better probability weights)
#
# If Brier(ML) >= Brier(naive), the model is poorly calibrated
# and wei_freq_c2c may be safer. Use cross_c2c() to ensemble.
# High failure rate is normal - most groups have <5 observations
# Baseline-only diagnostics (no ML models):
ml_baseline <- list(
data = rbind(occup_2006, occup_2008),
cat_var = "code",
method = character(0),
features = character(0)
)
baseline_cv <- cat2cat_ml_run(mappings, ml_baseline)
print(baseline_cv)
#> === cat2cat ML Cross-Validation Results ===
#>
#> ACCURACY (higher is better):
#> naive (1/k): 0.4380
#> freq (most common): 0.7045
#>
#> BRIER SCORE (lower is better, range 0-1):
#> naive: 0.3277
#> freq: 0.2194
#>
#> MEAN P(TRUE CLASS) (higher is better):
#> naive: 0.4380
#> freq: 0.6136
#>
#> ACCURACY: ML vs BASELINES (percent of groups where ML wins):
#>
#> SKIPPED GROUPS (single category or <5 observations):
# }