Manual mapping for an aggregated panel dataset

Manual mapping of an inconsistently coded categorical variable according to the user provided mappings (equations).

cat2cat_agg(
  data = list(old = NULL, new = NULL, cat_var_old = NULL, cat_var_new = NULL, time_var =
    NULL, freq_var = NULL),
  ...
)

Arguments

data: list with 5 named fields `old`, `new`, `cat_var`, `time_var`, `freq_var`.
...: mapping equations where direction is set with any of, `>`, `<`, `%>%`, `%<%`.

Value

`named list` with 2 fields old and new - 2 data.frames. There will be added additional columns to each. The new columns are added instead of the additional metadata as we are working with new datasets where observations could be replicated. For the transparency the probability and number of replications are part of each observation in the `data.frame`.

Details

data argument - list with fields

"old": data.frame older time point in the panel
"new": data.frame more recent time point in the panel
"cat_var": character - deprecated - name of the categorical variable
"cat_var_old": character name of the categorical variable in the old period
"cat_var_new": character name of the categorical variable in the new period
"time_var": character name of time variable
"freq_var": character name of frequency variable

Note

All mapping equations have to be valid ones.

Examples

data("verticals", package = "cat2cat")
agg_old <- verticals[verticals$v_date == "2020-04-01", ]
agg_new <- verticals[verticals$v_date == "2020-05-01", ]

# cat2cat_agg - can map in both directions at once
# although usually we want to have the old or the new representation

agg <- cat2cat_agg(
  data = list(
    old = agg_old,
    new = agg_new,
    cat_var_old = "vertical",
    cat_var_new = "vertical",
    time_var = "v_date",
    freq_var = "counts"
  ),
  Automotive %<% c(Automotive1, Automotive2),
  c(Kids1, Kids2) %>% c(Kids),
  Home %>% c(Home, Supermarket)
)

## possible processing
library("dplyr")
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
agg %>%
  bind_rows() %>%
  group_by(v_date, vertical) %>%
  summarise(
    sales = sum(sales * prop_c2c),
    counts = sum(counts * prop_c2c),
    v_date = first(v_date)
  )
#> `summarise()` has grouped output by 'v_date'. You can override using the
#> `.groups` argument.
#> # A tibble: 22 × 4
#> # Groups:   v_date [2]
#>    v_date     vertical    sales  counts
#>    <chr>      <chr>       <dbl>   <dbl>
#>  1 2020-04-01 Automotive1  49.4    87.1
#>  2 2020-04-01 Automotive2  27.2    47.9
#>  3 2020-04-01 Books       104.   7489  
#>  4 2020-04-01 Clothes     105.   1078  
#>  5 2020-04-01 Electronics  87.9  9544  
#>  6 2020-04-01 Fashion      94.5  7399  
#>  7 2020-04-01 Health       94.4 16102  
#>  8 2020-04-01 Home         94.3  2414  
#>  9 2020-04-01 Kids1       103.  17686  
#> 10 2020-04-01 Kids2       111.  32349  
#> # ℹ 12 more rows