Manual mapping of an inconsistently coded categorical variable according to the user provided mappings (equations).
cat2cat_agg(
data = list(old = NULL, new = NULL, cat_var_old = NULL, cat_var_new = NULL, time_var =
NULL, freq_var = NULL),
...
)
list with 5 named fields `old`, `new`, `cat_var`, `time_var`, `freq_var`.
mapping equations where direction is set with any of, `>`, `<`, `%>%`, `%<%`.
`named list` with 2 fields old and new - 2 data.frames. There will be added additional columns to each. The new columns are added instead of the additional metadata as we are working with new datasets where observations could be replicated. For the transparency the probability and number of replications are part of each observation in the `data.frame`.
data argument - list with fields
data.frame older time point in the panel
data.frame more recent time point in the panel
character - deprecated - name of the categorical variable
character name of the categorical variable in the old period
character name of the categorical variable in the new period
character name of time variable
character name of frequency variable
All mapping equations have to be valid ones.
data("verticals", package = "cat2cat")
agg_old <- verticals[verticals$v_date == "2020-04-01", ]
agg_new <- verticals[verticals$v_date == "2020-05-01", ]
# cat2cat_agg - can map in both directions at once
# although usually we want to have the old or the new representation
agg <- cat2cat_agg(
data = list(
old = agg_old,
new = agg_new,
cat_var_old = "vertical",
cat_var_new = "vertical",
time_var = "v_date",
freq_var = "counts"
),
Automotive %<% c(Automotive1, Automotive2),
c(Kids1, Kids2) %>% c(Kids),
Home %>% c(Home, Supermarket)
)
## possible processing
library("dplyr")
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
agg %>%
bind_rows() %>%
group_by(v_date, vertical) %>%
summarise(
sales = sum(sales * prop_c2c),
counts = sum(counts * prop_c2c),
v_date = first(v_date)
)
#> `summarise()` has grouped output by 'v_date'. You can override using the
#> `.groups` argument.
#> # A tibble: 22 × 4
#> # Groups: v_date [2]
#> v_date vertical sales counts
#> <chr> <chr> <dbl> <dbl>
#> 1 2020-04-01 Automotive1 49.4 87.1
#> 2 2020-04-01 Automotive2 27.2 47.9
#> 3 2020-04-01 Books 104. 7489
#> 4 2020-04-01 Clothes 105. 1078
#> 5 2020-04-01 Electronics 87.9 9544
#> 6 2020-04-01 Fashion 94.5 7399
#> 7 2020-04-01 Health 94.4 16102
#> 8 2020-04-01 Home 94.3 2414
#> 9 2020-04-01 Kids1 103. 17686
#> 10 2020-04-01 Kids2 111. 32349
#> # ℹ 12 more rows