Advanced Workflows
Maciej Nasinski
2026-05-08
Source:vignettes/cat2cat_advanced.Rmd
cat2cat_advanced.RmdThis vignette collects the advanced cat2cat workflows in
one place: ML weights, multi-period chaining, panel data with
identifiers, aggregated data, and regression on replicated data. It
assumes you’ve read Get Started.
Use this vignette by module:
- ML weights if you want feature-informed probabilities rather than only naive or frequency weights.
- Multi-period chaining if your harmonisation spans 3 or more waves.
- Panel data with subject identifiers if you have a rotational panel or other stable subject IDs.
- Aggregated data and special cases if you only observe category totals or need hierarchical-code mappings.
- Regression on replicated data if your main task is estimation and inference after harmonisation.
library(cat2cat)
library(dplyr)
library(tidyr)
library(fixest)
data(occup, package = "cat2cat")
data(occup_panel, package = "cat2cat")
data(trans, package = "cat2cat")
data(verticals, package = "cat2cat")
data(verticals2, package = "cat2cat")
occup_2006 <- occup[occup$year == 2006, ]
occup_2008 <- occup[occup$year == 2008, ]
occup_2010 <- occup[occup$year == 2010, ]
occup_2012 <- occup[occup$year == 2012, ]ML weights
Machine-learning weights are useful when the mapping table alone is too coarse and observed features help distinguish which target category is most plausible for a replicated observation.
This section shows:
- when ML weights are worth trying,
- how to specify the
mlargument incat2cat(), - how to validate whether ML improves on simpler baselines,
- and how to handle rows where ML probabilities cannot be produced.
In practice, ML is most helpful when replication is substantial and
the ambiguous categories differ systematically by observed
characteristics such as age, education, experience, or salary. If ML
does not improve on the frequency baseline in
cat2cat_ml_run(), the simpler wei_freq_c2c
weights are usually the better choice.
ml_setup <- list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = c("knn", "rf", "lda"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50),
on_fail = "freq",
fail_warn = TRUE
)
result_ml <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)Validate whether ML adds value before using it in production:
cv <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)
print(cv)Baseline-only diagnostics are also available:
ml_baseline <- list(
data = bind_rows(occup_2008, occup_2010),
cat_var = "code",
method = character(0),
features = character(0)
)
cv_baseline <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "forward"),
ml = ml_baseline
)
print(cv_baseline)If ML probabilities cannot be produced for some replicated rows, use
on_fail and fail_warn:
-
on_fail = "freq"(default): replace failed ML rows withwei_freq_c2c -
on_fail = "naive": replace failed ML rows withwei_naive_c2c -
on_fail = "na": keep failed ML rows asNA -
on_fail = "error": stop immediately -
fail_warn = TRUE(default): warn with affected rows/observations per method -
fail_warn = FALSE: silence the warning
Multi-period chaining
Repeated cross-sections or longitudinal datasets often contain more
than two waves. To bring all waves onto the same encoding, apply
cat2cat() iteratively, feeding the output of one step into
the next.
Use this module when you need one harmonised categorical variable across 3 or more periods. The main design choice is whether to chain backward or forward, and whether truncating hierarchical codes can reduce replication before chaining.
In the examples below, occup is a repeated cross-section
dataset, not a panel. The chained outputs therefore combine harmonised
cross-sections from multiple years.
Optional: passing mappings$freqs_df
In most cases you do not need to pass
mappings$freqs_df. If it is omitted, cat2cat()
computes base-period frequencies internally from the provided data.
Pass mappings$freqs_df only when you need explicit
control over base frequencies.
The object should be a two-column data frame: category name in the first column and counts in the second.
Mapping table stability and truncation
Before chaining, inspect how disruptive the transition table is:
max_digits <- max(nchar(as.character(trans[[1]])), nchar(as.character(trans[[2]])))
stability <- sapply(1:max_digits, function(d) {
old_trunc <- substr(as.character(trans[[1]]), 1, d)
new_trunc <- substr(as.character(trans[[2]]), 1, d)
mean(old_trunc != new_trunc) * 100
})
data.frame(
digits = 1:max_digits,
pct_changed = round(stability, 1)
)
#> digits pct_changed
#> 1 1 6.5
#> 2 2 40.4
#> 3 3 77.5
#> 4 4 85.7
#> 5 5 100.0
#> 6 6 100.0Truncating codes to fewer digits can reduce replication:
For backward mapping this is often true, but for forward mapping truncation can move in the opposite direction. Collapsing detailed codes into broader prefixes can create additional many-to-one and one-to-many ties on the old-code side, which may increase replication in the mapped period.
occup_2008_trunc <- occup_2008
occup_2008_trunc$code <- substr(occup_2008_trunc$code, 1, 4)
occup_2010_trunc <- occup_2010
occup_2010_trunc$code <- substr(occup_2010_trunc$code, 1, 4)
trans_trunc <- unique(data.frame(
old = substr(trans$old, 1, 4),
new = substr(trans$new, 1, 4)
))
back_full <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
back_trunc <- cat2cat(
data = list(old = occup_2008_trunc, new = occup_2010_trunc,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans_trunc, direction = "backward")
)
fwd_full <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "forward")
)
fwd_trunc <- cat2cat(
data = list(old = occup_2008_trunc, new = occup_2010_trunc,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans_trunc, direction = "forward")
)
data.frame(
mapping = c("backward full (4->6)", "backward trunc (4->4)",
"forward full (6->4)", "forward trunc (4->4)"),
mean_rep = c(mean(back_full$old$rep_c2c), mean(back_trunc$old$rep_c2c),
mean(fwd_full$new$rep_c2c), mean(fwd_trunc$new$rep_c2c))
)
#> mapping mean_rep
#> 1 backward full (4->6) 23.479114
#> 2 backward trunc (4->4) 4.697948
#> 3 forward full (6->4) 1.363676
#> 4 forward trunc (4->4) 3.268608Backward chaining
step1 <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
step2 <- cat2cat(
data = list(old = occup_2006, new = step1$old,
cat_var_old = "code", cat_var_new = "g_new_c2c",
time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
harmonised_back <- bind_rows(
step2$old,
step1$old,
step1$new,
dummy_c2c(occup_2012, "code")
)Validation: weighted counts should match the original counts within each year.
harmonised_back %>%
group_by(year) %>%
summarise(weighted_n = round(sum(wei_freq_c2c)), .groups = "drop") %>%
left_join(count(occup, year), by = "year")
#> # A tibble: 4 × 3
#> year weighted_n n
#> <int> <dbl> <int>
#> 1 2006 16540 16540
#> 2 2008 17223 17223
#> 3 2010 17323 17323
#> 4 2012 18040 18040Forward chaining
trans_fwd <- rbind(
trans,
data.frame(old = "no_cat",
new = setdiff(c(occup_2010$code, occup_2012$code), trans$new))
)
fwd1 <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans_fwd, direction = "forward")
)
fwd2 <- cat2cat(
data = list(old = fwd1$new, new = occup_2012,
cat_var_old = "g_new_c2c", cat_var_new = "code",
time_var = "year"),
mappings = list(trans = trans_fwd, direction = "forward")
)
harmonised_fwd <- bind_rows(
dummy_c2c(occup_2006, "code"),
fwd1$old,
fwd1$new,
fwd2$new
)Adding ML to the chain
step1_ml <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)
step2_ml <- cat2cat(
data = list(old = occup_2006, new = step1_ml$old,
cat_var_old = "code", cat_var_new = "g_new_c2c",
time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = ml_setup
)Panel data with subject identifiers
If subjects have stable identifiers across waves, id_var
can reduce unnecessary replication by directly matching returning
subjects.
Use this module only when the identifier truly tracks the same subject across adjacent waves and short-run category changes are unlikely to represent genuine transitions rather than coding changes.
If you have a complete panel with every subject observed in both
periods and no missing category values, you may not need probabilistic
harmonisation at all. In that case, the target-period category can often
be joined back to the earlier record by id_var, and the
task is mostly a deterministic join. cat2cat() is more
useful when the panel is incomplete, rotational, or mixed with new
entrants and leavers, so some observations still need the mapping-table
replication path.
panel_old <- occup_panel[occup_panel$quarter == "2009Q4", ]
panel_new <- occup_panel[occup_panel$quarter == "2010Q1", ]
shared_ids <- intersect(panel_old$panel_id, panel_new$panel_id)
length(shared_ids)
#> [1] 450
result_id <- cat2cat(
data = list(
old = panel_old,
new = panel_new,
id_var = "panel_id",
cat_var = "code",
time_var = "quarter"
),
mappings = list(trans = trans, direction = "backward")
)How id_var works:
- direct match: workers observed in both periods receive
rep_c2c = 1and weight 1, - replication path: workers observed in only one period go through the standard mapping-table replication.
table(result_id$old$rep_c2c)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 13 16 18 19 21 22 23 24 25
#> 457 6 12 56 75 54 84 88 81 30 165 26 64 180 190 42 44 46 48 25
#> 33 34 46 70
#> 33 170 276 70
sum(result_id$old$wei_freq_c2c)
#> [1] 600
nrow(panel_old)
#> [1] 600Compare with and without identifiers:
result_no_id <- cat2cat(
data = list(
old = panel_old,
new = panel_new,
cat_var = "code",
time_var = "quarter"
),
mappings = list(trans = trans, direction = "backward")
)
cat("WITH id_var average replication:", round(mean(result_id$old$rep_c2c), 2), "\n")
#> WITH id_var average replication: 18.49
cat("WITHOUT id_var average replication:", round(mean(result_no_id$old$rep_c2c), 2), "\n")
#> WITHOUT id_var average replication: 23.49Use id_var when:
- identifiers are reliable,
- the time gap is short relative to true mobility,
- and direct matches are informative about the coding change.
Aggregated data and special cases
When only aggregate counts are available, use
cat2cat_agg() with mapping equations rather than micro-data
replication.
Use this module when you do not have person-level data, or when the classification itself has a hierarchical code structure that can be exploited to build coarser mappings.
agg_old <- verticals[verticals$v_date == "2020-04-01", ]
agg_new <- verticals[verticals$v_date == "2020-05-01", ]
agg <- cat2cat_agg(
data = list(
old = agg_old,
new = agg_new,
cat_var = "vertical",
time_var = "v_date",
freq_var = "counts"
),
Automotive %<% c(Automotive1, Automotive2),
c(Kids1, Kids2) %>% c(Kids),
Home %>% c(Home, Supermarket)
)Inspect how categories were proportionally redistributed:
agg$old[agg$old$vertical %in% c("Automotive1", "Automotive2"), ]
#> vertical sales counts v_date prop_c2c
#> 4 Automotive1 76.54302 135 2020-04-01 0.6452772
#> 4.1 Automotive2 76.54302 135 2020-04-01 0.3547228
agg$new[agg$new$vertical %in% c("Kids1", "Kids2"), ]
#> vertical sales counts v_date prop_c2c
#> 13 Kids1 105.4317 874 2020-05-01 0.3534726
#> 13.1 Kids2 105.4317 874 2020-05-01 0.6465274Hierarchical codes can also be used to build coarser mapping tables when an official transition table is unavailable:
trans_2digit <- data.frame(
old = substr(trans$old, 1, 2),
new = substr(trans$new, 1, 2)
)
trans_2digit <- unique(trans_2digit)
cat("2-digit mapping rows:", nrow(trans_2digit),
"vs full mapping rows:", nrow(trans))
#> 2-digit mapping rows: 122 vs full mapping rows: 2666This works for classifications with stable prefix hierarchies such as ISCO, ICD, NACE, CPC, or HS codes.
Regression on replicated data
The replication is neutral for regressions on non-mapped covariates because per-subject weights sum to one. Standard errors, however, must be corrected because replication inflates the row count.
Use this module when your end goal is estimation rather than descriptive harmonisation. The key issue is not coefficient bias for non-mapped regressors, but valid inference after replication.
Neutral impact demonstration
lms_orig <- lm(
I(log(salary)) ~ age + sex + factor(edu) + parttime + exp,
data = occup,
weights = multiplier
)
lms_harmonised <- lm(
I(log(salary)) ~ age + sex + factor(edu) + parttime + exp,
data = cat2cat_data_back,
weights = multiplier * wei_freq_c2c
)
summary_c2c(lms_harmonised, df_old = nrow(occup))
#> Estimate Std. Error t value Pr(>|t|) correct
#> (Intercept) 8.567134022 0.0055759904 1536.43270 0.000000e+00 2.246708
#> age -0.001669601 0.0001429794 -11.67722 1.689342e-31 2.246708
#> sexTRUE 0.254854849 0.0015855706 160.73384 0.000000e+00 2.246708
#> factor(edu)2 -0.123208217 0.0031187751 -39.50532 0.000000e+00 2.246708
#> factor(edu)3 -0.390643025 0.0037961258 -102.90571 0.000000e+00 2.246708
#> factor(edu)4 -0.465471793 0.0022196689 -209.70325 0.000000e+00 2.246708
#> factor(edu)5 -0.443598202 0.0031191398 -142.21812 0.000000e+00 2.246708
#> factor(edu)6 -0.678797186 0.0022404213 -302.97747 0.000000e+00 2.246708
#> factor(edu)7 -0.617843013 0.0192393288 -32.11354 6.112599e-226 2.246708
#> factor(edu)8 -0.717563371 0.0035222572 -203.72259 0.000000e+00 2.246708
#> parttime 1.999007607 0.0037223872 537.02301 0.000000e+00 2.246708
#> exp 0.011337142 0.0001368157 82.86435 0.000000e+00 2.246708
#> std.error_c statistic_c p.value_c reference_dist
#> (Intercept) 0.0125276207 683.859629 0.000000e+00 t
#> age 0.0003212329 -5.197479 2.025825e-07 t
#> sexTRUE 0.0035623137 71.541945 0.000000e+00 t
#> factor(edu)2 0.0070069760 -17.583650 4.650384e-69 t
#> factor(edu)3 0.0085287850 -45.802893 0.000000e+00 t
#> factor(edu)4 0.0049869472 -93.338022 0.000000e+00 t
#> factor(edu)5 0.0070077954 -63.300678 0.000000e+00 t
#> factor(edu)6 0.0050335718 -134.853979 0.000000e+00 t
#> factor(edu)7 0.0432251482 -14.293601 2.792952e-46 t
#> factor(edu)8 0.0079134824 -90.676056 0.000000e+00 t
#> parttime 0.0083631160 239.026649 0.000000e+00 t
#> exp 0.0003073848 36.882568 6.567139e-295 tsummary_c2c() scales naive standard errors by the
replication factor:
Fixed effects regression
harmonised_fe <- cat2cat_data_back %>%
prune_c2c(method = "nonzero") %>%
mutate(orig_obs_id = interaction(year, index_c2c, drop = TRUE, lex.order = TRUE)) %>%
filter(!is.na(g_new_c2c), !is.na(salary), salary > 0)
fe_model_cluster <- feols(
log(salary) ~ age + sex + factor(edu) + parttime + exp | g_new_c2c + year,
data = harmonised_fe,
weights = ~multiplier * wei_freq_c2c,
cluster = ~orig_obs_id
)
summary(fe_model_cluster)
#> OLS estimation, Dep. Var.: log(salary)
#> Observations: 348,744
#> Weights: multiplier * wei_freq_c2c
#> Fixed-effects: g_new_c2c: 1,561, year: 4
#> Standard-errors: Clustered (orig_obs_id)
#> Estimate Std. Error t value Pr(>|t|)
#> age -0.000589 0.000353 -1.67043 0.094839 .
#> sexTRUE 0.127013 0.005018 25.30943 < 2.2e-16 ***
#> factor(edu)2 -0.122142 0.009111 -13.40599 < 2.2e-16 ***
#> factor(edu)3 -0.220058 0.009472 -23.23189 < 2.2e-16 ***
#> factor(edu)4 -0.254005 0.007823 -32.46795 < 2.2e-16 ***
#> factor(edu)5 -0.236455 0.009999 -23.64695 < 2.2e-16 ***
#> factor(edu)6 -0.329733 0.008764 -37.62281 < 2.2e-16 ***
#> factor(edu)7 -0.299026 0.027005 -11.07290 < 2.2e-16 ***
#> factor(edu)8 -0.335429 0.010119 -33.14734 < 2.2e-16 ***
#> parttime 1.871971 0.011668 160.44062 < 2.2e-16 ***
#> exp 0.008152 0.000332 24.51840 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.36761 Adj. R2: 0.713868
#> Within R2: 0.555095Do not cluster directly on index_c2c after binding
multiple waves, because index_c2c is created separately
inside each cat2cat() call and can repeat across years.
Instead, build a wave-specific original-observation identifier such as
interaction(year, index_c2c) and cluster on that. This
treats all replications of the same source row as one cluster without
incorrectly merging different people from different waves.
Choosing the right advanced workflow
| Problem | Recommended tool |
|---|---|
| Need feature-informed weights |
cat2cat(..., ml = ...) +
cat2cat_ml_run()
|
| Need 3+ wave harmonisation | iterative cat2cat() chaining |
| Stable subject identifiers across waves | id_var |
| Only aggregated counts available | cat2cat_agg() |
| Regression on replicated data |
summary_c2c() or clustered inference |
Next steps
- Get Started for core concepts and the two-period workflow
- Choosing Weights and Validating ML for weight comparisons and robustness checks