组合方案以按R中的组替换中位数 [英] Combining scenario to Replace Medians by Groups in R

查看:51
本文介绍了组合方案以按R中的组替换中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数据集

mydat <- 
structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("52382MCK", 
"52499MCK"), class = "factor"), item = c(11709L, 11709L, 11709L, 
11709L, 11708L, 11708L, 11708L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11202L, 11203L, 11203L, 11204L, 11204L, 11205L, 11205L
), sales = c(30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 
15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 
20L, 15L, 2L, 10L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), action = c(0L, 
1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 1L)), row.names = c(NA, -35L), class = "data.frame")
# coerce to data.table
setDT(mydat)

使用此数据集,将执行多项操作。

with this dataset, several operations are performed.

1. selecting scenario by groups.

因此有操作列。它只能有两个值零(0)或一个(1)。

So there is action column. It can have only two values zero(0) or one(1).

方案的场景是第一个动作类别之前的零个动作类别数量和一类动作后为零。

The scenarios are the number of zero categories of action before first category of action and the number of zeros after one category of action.

For example
52382МСК    11709

在这种情况下,我们有1个零类别的操作列。在动作col的第一类别之前,以及在动作col的第一类别之后的两个零。注意:当我们有2个零类别的动作col时,可能是这种情况。

it is scenario when we have 1 zero category of action col. before first category of action col , and two zeros after first category of action col. Note: maybe scenario when we have 2 zero category of action col. before first category of action col , and 1 zero after first category of action col.

mydat1

code    item    sales   action
52382МСК    11709   30  0
52382МСК    11709   10  1
52382МСК    11709   20  0
52382МСК    11709   15  0

要检测这种情况,我使用此脚本/
该脚本非常有效,感谢@Uwe

to detect this scenario i use this script/ This script very well works, thank for @Uwe

library(data.table)
library(magrittr)

max_zeros <- 3
zeros <- sapply(0:max_zeros, stringr::str_dup, string = "0")
names(zeros) <- as.character(nchar(zeros))
sc <- CJ(zeros.before = zeros, zeros.after = zeros)[
  , scenario.name := paste(nchar(zeros.before), nchar(zeros.after), sep = "-")][
    , action.pattern := sprintf("%s1+(?=%s)", zeros.before, zeros.after)][]
# special case: all zero
sc0 <- data.table(
  zeros.before = NA,
  zeros.after = NA, 
  scenario.name = "no1", 
  action.pattern = "^0+$")
sc <- rbind(sc0, sc)

然后

setDT(mydat)
class <- mydat[, .(scenario.name = sc$scenario.name[
  paste(action, collapse = "") %>% 
    stringr::str_count(sc$action.pattern) %>%
    is_greater_than(0) %>% 
    which() %>% 
    max()
  ]),
  by = .(code, item)][]

class
mydat[class, on = .(code, item)]

所以我获得了情景类别的数据。

So i get data with class of scenario.

2.operation it is replace median.

对于每种情况,均按零类别计算中位数。

For each scenario median by zero category is calculated.

我需要按操作列按1前面的零类别来计算中值,即按一列操作列之前的零,按操作列按2个零来计算中位数。
仅按销售列对第一类动作列
进行的中位数替换。
如果中位数大于销售额,则不要替换。

I need to calculate the median by 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. The median replacing performed only for first category of action column by sale column. if median is more than the sales, then do not replace it.

要做到这一点,我使用脚本

To do it i use the script

sales_action <- function(DF, zeros_before, zeros_after) {
  library(data.table)
  library(magrittr)
  action_pattern <- 
    do.call(sprintf, 
            c(fmt = "%s1+(?=%s)", 
              stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
            ))
  message("Action pattern used: ", action_pattern)
  setDT(DF)[, rn := .I]
  tmp <- DF[, paste(action, collapse = "") %>% 
              stringr::str_locate_all(action_pattern) %>% 
              as.data.table() %>% 
              lapply(function(x) rn[x]),
            by = .(code, item)][
              , end := end + zeros_after]
  DF[tmp, on = .(code, item, rn >= start, rn <= end), 
     med := as.double(median(sales[action == 0])), by = .EACHI][
       , output := as.double(sales)][action == 1, output := pmin(sales, med)][
         , c("rn", "med") := NULL][]
}

然后

sales_action(mydat, 1L, 2L)

所以我得到了结果。

每次我必须手动输入要用中位数替换的情况

Each time i must manually enter the scenario to replacing by median

sales_action(mydat, 1L, 2L)
sales_action(mydat, 3L, 1L)
sales_action(mydat, 2L, 2L)

等。

如何为此,替换中位数会自动针对所有可能的情况执行
,这样我就不会每次都写
sales_action(mydat,.L,.L)

How to do that replacing median was perform for all possible scenarios automatically so that I do not write every time sales_action(mydat, .L, .L)

因此是输出示例

code    i    tem    sales   action  output  pattern
52382MCK    11709   30        0       30    01+00
52382MCK    11709   10        1       10    01+00
52382MCK    11709   20        0       20    01+00
52382MCK    11709   15        0       15    01+00
52382MCK    1170    8         0        8    01+00
52382MCK    1170    10        1        8    01+00
52382MCK    1170    2         0        2    01+00
52382MCK    1170    15        0        15   01+00


推荐答案

如果我理解正确,则OP希望通过将操作期间的 sales 数据与销售的中位数进行比较来分析销售操作的成功

If I understand correctly, the OP wants to analyse the success of sales actions by comparing sales figures during actions with the median sales of the periods immediately before and after the sales action.

存在一些挑战:


  1. 每个代码项目组可能有多个销售操作。

  2. 在一次销售行动前后,可用数据可能比请求的3天少三天。

  1. There might be more than one sales action per code, item group.
  2. The available data might cover less than the requested 3 three days each before and after a sales action.

恕我直言,场景的引入是解决问题2的绕道而行。

IMHO, the introduction of scenarios is a detour to handle issue 2.

Th下面的方法


  • 标识每个代码 item 组,

  • 之前最多添加三个零动作行,在之后最多添加三行>每个销售操作,

  • 计算这些行的中位数销售,并且

  • 更新输出如果销售操作中的销售数字超过了周围的零操作行的中位数。

  • identifies the sales actions within each code, item group,
  • picks up to three zero action rows before and up to three rows after each sales action,
  • computes the median sales of those rows, and
  • updates output in case the sales figure within a sales action exceeds the median of the surrounding zero action rows.

术语类别是由OP创造的,用于区分销售活动的时间段( action == 1L 的连续条纹)和之前和之后的零操作时间段。

The term category has been coined by the OP to distinguish between periods of sales actions (contiguous streaks of action == 1L) and the zero action periods before and after.

library(data.table)
# coerce to data.table and create categories
setDT(mydat)[, cat := rleid(action), by = .(code, item)][]

# extract action categories, identify preceeding & succeeding zero action categories
mycat <- mydat[, .(action = first(action)), by = .(code, item, cat = cat)][
  , `:=`(before = cat - 1L, after = cat + 1L)][action == 1L]

mycat

       code  item cat action before after
1: 52382MCK 11709   2      1      1     3
2: 52382MCK 11708   2      1      1     3
3: 52382MCK 11710   2      1      1     3
4: 52382MCK 11710   4      1      3     5
5: 52382MCK 11710   6      1      5     7
6: 52499MCK 11203   2      1      1     3
7: 52499MCK 11205   1      1      0     2

请注意,组 52382MCK,11710 包含三个单独的销售操作。在和在可能指向不存在的 cat ,但是

Note that group 52382MCK, 11710 includes three separate sales actions. before and after may point to non-existing cat but this will be rectified automatically during the subsequent joins.

# compute median of surrouding zero action categories
action_cat_median <- 
  rbind(
    # get sales from up to 3 zero action rows before action category
    mydat[mycat, on = .(code, item, cat = before), 
          .(sales = tail(sales, 3), i.cat), by =.EACHI],
    # get sales from up to 3 zero action rows after action category
    mydat[mycat, on = .(code, item, cat = after), 
          .(sales = head(sales, 3), i.cat), by =.EACHI]
  )[
    # remove empty groups
    !is.na(sales)][
      # compute median for each action category
      , .(med = as.double(median(sales))), by = .(code, item, cat = i.cat)]

action_cat_median




       code  item cat  med
1: 52382MCK 11709   2 20.0
2: 52382MCK 11708   2  2.5
3: 52382MCK 11710   2 10.0
4: 52382MCK 11710   4 10.0
5: 52382MCK 11710   6 10.0
6: 52499MCK 11203   2  2.0




# prepare result
mydat[, output := as.double(sales)][
  # update join
  action_cat_median, on = .(code, item, cat), output := pmin(sales, med)]

编辑:或者,调用 pmin()可以替换为非装备联接,该联接仅更新销售额超过中位数的行:

Alternatively, the call to pmin() can be replaced by a non-equi join which updates only rows where sales exceeds the median:

# prepare result, alternative approach
mydat[, output := as.double(sales)][
  # non-equi update join
  action_cat_median, on = .(code, item, cat, output > med), output := med]


mydat




        code  item sales action cat output
 1: 52382MCK 11709    30      0   1   30.0
 2: 52382MCK 11709    10      1   2   10.0
 3: 52382MCK 11709    20      0   3   20.0
 4: 52382MCK 11709    15      0   3   15.0
 5: 52382MCK 11708     2      0   1    2.0
 6: 52382MCK 11708    10      1   2    2.5
 7: 52382MCK 11708     3      0   3    3.0
 8: 52382MCK 11710    30      0   1   30.0
 9: 52382MCK 11710    10      0   1   10.0
10: 52382MCK 11710    20      0   1   20.0
11: 52382MCK 11710    15      1   2   10.0
12: 52382MCK 11710     2      0   3    2.0
13: 52382MCK 11710    10      0   3   10.0
14: 52382MCK 11710     3      0   3    3.0
15: 52382MCK 11710    30      0   3   30.0
16: 52382MCK 11710    10      0   3   10.0
17: 52382MCK 11710    20      0   3   20.0
18: 52382MCK 11710    15      1   4   10.0
19: 52382MCK 11710     2      0   5    2.0
20: 52382MCK 11710    10      0   5   10.0
21: 52382MCK 11710     3      0   5    3.0
22: 52382MCK 11710    30      0   5   30.0
23: 52382MCK 11710    10      0   5   10.0
24: 52382MCK 11710    20      0   5   20.0
25: 52382MCK 11710    15      1   6   10.0
26: 52382MCK 11710     2      0   7    2.0
27: 52382MCK 11710    10      0   7   10.0
28: 52382MCK 11710     3      0   7    3.0
29: 52499MCK 11202     2      0   1    2.0
30: 52499MCK 11203     2      0   1    2.0
31: 52499MCK 11203     2      1   2    2.0
32: 52499MCK 11204     2      0   1    2.0
33: 52499MCK 11204     2      0   1    2.0
34: 52499MCK 11205     2      1   1    2.0
35: 52499MCK 11205     2      1   1    2.0
        code  item sales action cat output


以下行已更新:

mydat[output != sales]




       code  item sales action cat output
1: 52382MCK 11708    10      1   2    2.5
2: 52382MCK 11710    15      1   2   10.0
3: 52382MCK 11710    15      1   4   10.0
4: 52382MCK 11710    15      1   6   10.0


这篇关于组合方案以按R中的组替换中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆