替换中位数时，Rscript工作不正确 [英] incorrect Rscript work when replacing medians

查看：109 发布时间：2020/5/28 20:33:30 r dplyr data.table plyr

本文介绍了替换中位数时，Rscript工作不正确的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有数据集

mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = "52382МСК", class = "factor"), item = c(11709L, 
11709L, 11709L, 11709L, 1170L, 1170L, 1170L, 1170L), sales = c(30L, 
10L, 20L, 15L, 8L, 10L, 2L, 15L), action = c(0L, 1L, 0L, 0L, 
0L, 1L, 0L, 0L)), .Names = c("code", "item", "sales", "action"
), class = "data.frame", row.names = c(NA, -8L))

按代码和项目分为两组

code    item
52382МСК    11709
52382МСК    1170

我也有行动专栏.它只能有两个值零(0)或一(1).我需要按操作列按1前面的零类别来计算中位数，即按操作列的一类之前的零，按操作列按2零来计算中位数的中位数. 如果中位数大于销售额，则不要替换.

如果我按动作列具有三个前面的零类别，即在动作列的一个类别之前具有三个零，并且按动作列具有一个类别的三个零之前，则此解决方案很好. 但是如果我按动作列有1个在零前面的零，即在动作列的一类前面有零，而在动作列的2个零在那一类之后.它无法正常工作

replacements <- 
  data_frame(
    action1      = which(mydat$action == 1L),
    group        = rep(1:length(action1), each = 2, length.out = length(action1)),
    sales1       = mydat$sales[action1],
    sales_before = mydat$sales[action1 -1L],
    sales_after  = mydat$sales[action1 +2L]
  ) %>%
  group_by(group) %>%
  mutate(
    med   = median(c(sales_before, sales_after)),
    output = pmin(sales1, med)
  )

mydat$output <- mydat$sales
mydat$output[replacements$action1] <- replacements$output

我得到输出

   code  item sales action output
1 52382МСК 11709    30      0     30
2 52382МСК 11709    10      1     10
3 52382МСК 11709    20      0     20
4 52382МСК 11709    15      0     15
5 52382МСК  1170     8      0      8
6 52382МСК  1170    10      1     10
7 52382МСК  1170     2      0      2
8 52382МСК  1170    15      0     15

但输出应该是

   code  item sales action output
1 52382МСК 11709    30      0     30
2 52382МСК 11709    10      1     10
3 52382МСК 11709    20      0     20
4 52382МСК 11709    15      0     15
5 52382МСК  1170     8      0      8
6 52382МСК  1170    10      1     **8**
7 52382МСК  1170     2      0      2
8 52382МСК  1170    15      0     15

我如何获得正确的输出?

编辑

   code item sales action
1     a    b     2      0
2     a    b     4      0
3     a    b     3      0
4     a    b    10      1
5     a    b     4      1
6     a    b    10      0
7     a    b     6      0
8     a    b     6      0
9     c    d     2      0
10    c    d     4      0
11    c    d     3      0
12    c    d    10      1
13    c    d    10      0
14    c    d     6      0
15    c    d     6      0

解决方案

该代码有几个严重缺陷:

它完全忽略了按code和item
它仅选择两个值进行中值计算，而不是零操作行的整个范围，而OP要求在每个action == 1之前包括1行，在每个action == 1之后包括2行.

如果我正确理解OP的要求，

OP希望通过计算每个销售操作前后的一段时间内的平均销售额(不包括该操作期间的销售量)并将其与实际销售量进行比较来衡量销售活动的效果
分别用code和item标识的每种产品.
每个销售操作的时间长度可能会有所不同(条纹为action == 1)
以及每次操作前后的天数.
预期产出是零工作日的销售数字.在工作日，该数字将被周围的零工作日的中位数销售代替，但前提是小于实际销售数字.

下面的函数采用三个参数，日期框和之前和之后销售活动的零天数.它会返回一个data.table，并按上述规则定义添加output列.

sales_action <- function(DF, zeros_before, zeros_after) {
  library(data.table)
  library(magrittr)
  action_pattern <- 
    do.call(sprintf, 
            c(fmt = "%s1+(?=%s)", 
              stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
            ))
  message("Action pattern used: ", action_pattern)
  setDT(DF)[, rn := .I]
  tmp <- DF[, paste(action, collapse = "") %>% 
              stringr::str_locate_all(action_pattern) %>% 
              as.data.table() %>% 
              lapply(function(x) rn[x]),
            by = .(code, item)][
              , end := end + zeros_after]
  DF[tmp, on = .(code, item, rn >= start, rn <= end), 
     med := as.double(median(sales[action == 0])), by = .EACHI][
       , output := as.double(sales)][action == 1, output := pmin(sales, med)][
         , c("rn", "med") := NULL][]
}

对于OP给出的mydat，我们得到

sales_action(mydat, 1L, 2L)

Action pattern used: 01+00
       code  item sales action output
1: 52382MCK 11709    30      0     30
2: 52382MCK 11709    10      1     10
3: 52382MCK 11709    20      0     20
4: 52382MCK 11709    15      0     15
5: 52382MCK  1170     8      0      8
6: 52382MCK  1170    10      1      8
7: 52382MCK  1170     2      0      2
8: 52382MCK  1170    15      0     15

这符合OP的预期结果.

作为第二个测试用例，我修改了OP编辑中的数据，以在其中一个组中包含第二个动作:

sales_action(mydat2, 1L, 2L)

Action pattern used: 01+00
    code item sales action output
 1:    a    b     2      0      2
 2:    a    b     4      0      4
 3:    a    b     3      0      3
 4:    a    b    10      1      3
 5:    a    b     4      1      3
 6:    a    b     2      0      2
 7:    a    b     4      0      4
 8:    a    b     3      0      3
 9:    a    b    10      1      6
10:    a    b     4      1      4
11:    a    b    10      0     10
12:    a    b     6      0      6
13:    a    b     6      0      6
14:    c    d     2      0      2
15:    c    d     4      0      4
16:    c    d     3      0      3
17:    c    d    10      1      6
18:    c    d    10      0     10
19:    c    d     6      0      6
20:    c    d     6      0      6

该样本包括第一个产品的两个操作，两个产品的持续时间均为2天，一个产品的持续时间为1天.

对于第4、5行，采用周围零动作行的中值，即median(c(3, 2, 4)) = 3.

对于第9行，第10行，c(3，10，6)的中位数为6，小于第9行的实际销售额.因此，只有第9行被中位数代替.

对于第17行，c(3，10，6)的中位数为6，代替了output中的实际销售数字.

如果在我们得到通知之前和之后需要3次零行动，那么

sales_action(mydat2, 3L, 3L)

Action pattern used: 0001+(?=000)
    code item sales action output
 1:    a    b     2      0      2
 2:    a    b     4      0      4
 3:    a    b     3      0      3
 4:    a    b    10      1      3
 5:    a    b     4      1      3
 6:    a    b     2      0      2
 7:    a    b     4      0      4
 8:    a    b     3      0      3
 9:    a    b    10      1      5
10:    a    b     4      1      4
11:    a    b    10      0     10
12:    a    b     6      0      6
13:    a    b     6      0      6
14:    c    d     2      0      2
15:    c    d     4      0      4
16:    c    d     3      0      3
17:    c    d    10      1      5
18:    c    d    10      0     10
19:    c    d     6      0      6
20:    c    d     6      0      6

说明

关键是要确定哪些行属于每个工作日条纹前后的时间段.由于action仅由0和1组成，因此我们可以在带有正则表达式的字符串中使用模式匹配.

为此，将action列折叠为一个字符串(对于每个code，item组而言是单独的).然后，使用stringr::str_locate_all()查找action pattern的开始和结束位置. action pattern是一个正则表达式，它查找由所需数量的前导和尾随0包围的1的任何序列.

实际上，正则表达式要复杂一些，因为我们必须使用 lookahead 来捕获重叠的动作模式，例如000111000111000中的000111000.前瞻正则表达式的end位置指向每个序列中的最后一个1而不是最后一个0，因此稍后将调整end.

最后，开始位置和结束位置将转换为DF中的行位置，而不是相对于组的位置，并在tmp中返回.

现在，我们进行非等额联接，以附加的med列汇总和更新DF，该列包含属于每个start，end范围的零操作行的中位销售额. /p>

其余步骤是准备output列并删除帮助器列.

数据

mydat2 <-
structure(list(code = c("a", "a", "a", "a", "a", "a", "a", "a", 
"a", "a", "a", "a", "a", "c", "c", "c", "c", "c", "c", "c"), 
    item = c("b", "b", "b", "b", "b", "b", "b", "b", "b", "b", 
    "b", "b", "b", "d", "d", "d", "d", "d", "d", "d"), sales = c(2L, 
    4L, 3L, 10L, 4L, 2L, 4L, 3L, 10L, 4L, 10L, 6L, 6L, 2L, 4L, 
    3L, 10L, 10L, 6L, 6L), action = c(0L, 0L, 0L, 1L, 1L, 0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA, 
-20L), class = "data.frame")

I have dataset

mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = "52382МСК", class = "factor"), item = c(11709L, 
11709L, 11709L, 11709L, 1170L, 1170L, 1170L, 1170L), sales = c(30L, 
10L, 20L, 15L, 8L, 10L, 2L, 15L), action = c(0L, 1L, 0L, 0L, 
0L, 1L, 0L, 0L)), .Names = c("code", "item", "sales", "action"
), class = "data.frame", row.names = c(NA, -8L))

it has two groups by code and item

code    item
52382МСК    11709
52382МСК    1170

Also i have action column. It can have only two values zero(0) or one(1). I need to calculate the median by 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. if median is more than the sales, then do not replace it.

This solution good works if i have three preceding zeros category by action column, i.e. which go before one category of action column, and by three zeros by action column that go after the one category. but if i have 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. it doesn't work correct

replacements <- 
  data_frame(
    action1      = which(mydat$action == 1L),
    group        = rep(1:length(action1), each = 2, length.out = length(action1)),
    sales1       = mydat$sales[action1],
    sales_before = mydat$sales[action1 -1L],
    sales_after  = mydat$sales[action1 +2L]
  ) %>%
  group_by(group) %>%
  mutate(
    med   = median(c(sales_before, sales_after)),
    output = pmin(sales1, med)
  )

mydat$output <- mydat$sales
mydat$output[replacements$action1] <- replacements$output

I get output

   code  item sales action output
1 52382МСК 11709    30      0     30
2 52382МСК 11709    10      1     10
3 52382МСК 11709    20      0     20
4 52382МСК 11709    15      0     15
5 52382МСК  1170     8      0      8
6 52382МСК  1170    10      1     10
7 52382МСК  1170     2      0      2
8 52382МСК  1170    15      0     15

but output should be

   code  item sales action output
1 52382МСК 11709    30      0     30
2 52382МСК 11709    10      1     10
3 52382МСК 11709    20      0     20
4 52382МСК 11709    15      0     15
5 52382МСК  1170     8      0      8
6 52382МСК  1170    10      1     **8**
7 52382МСК  1170     2      0      2
8 52382МСК  1170    15      0     15

how can i get correct output?

edit

   code item sales action
1     a    b     2      0
2     a    b     4      0
3     a    b     3      0
4     a    b    10      1
5     a    b     4      1
6     a    b    10      0
7     a    b     6      0
8     a    b     6      0
9     c    d     2      0
10    c    d     4      0
11    c    d     3      0
12    c    d    10      1
13    c    d    10      0
14    c    d     6      0
15    c    d     6      0

解决方案

The code has several severe flaws:

it complete ignores the grouping by code and item
it picks only two values for median calculation instead of the full range of zero action rows while the OP had requested to include 1 row before and 2 rows after each action == 1.

If I understand OP's requirements correctly,

the OP wants to measure the effect of a sales action by calculating the median sales in a period around each sales action (excluding the sales during the action) and comparing it with the actual sales
separately for each product identified by code and item.
The length of each sales action can vary (streaks of action == 1)
as well as the number of days before and after each action.
The expected output is the sales figures on zero action days. On action days, this figure is to be replaced by the median sales of the surrounding zero action days but only if it is less than actual sales figure.

The function below takes three arguments, the dateframe and the number of zero days before and after a sales action. It returns a data.table with the output column appended as defined by the rules above.

sales_action <- function(DF, zeros_before, zeros_after) {
  library(data.table)
  library(magrittr)
  action_pattern <- 
    do.call(sprintf, 
            c(fmt = "%s1+(?=%s)", 
              stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
            ))
  message("Action pattern used: ", action_pattern)
  setDT(DF)[, rn := .I]
  tmp <- DF[, paste(action, collapse = "") %>% 
              stringr::str_locate_all(action_pattern) %>% 
              as.data.table() %>% 
              lapply(function(x) rn[x]),
            by = .(code, item)][
              , end := end + zeros_after]
  DF[tmp, on = .(code, item, rn >= start, rn <= end), 
     med := as.double(median(sales[action == 0])), by = .EACHI][
       , output := as.double(sales)][action == 1, output := pmin(sales, med)][
         , c("rn", "med") := NULL][]
}

For mydat as given by the OP we get

sales_action(mydat, 1L, 2L)

Action pattern used: 01+00
       code  item sales action output
1: 52382MCK 11709    30      0     30
2: 52382MCK 11709    10      1     10
3: 52382MCK 11709    20      0     20
4: 52382MCK 11709    15      0     15
5: 52382MCK  1170     8      0      8
6: 52382MCK  1170    10      1      8
7: 52382MCK  1170     2      0      2
8: 52382MCK  1170    15      0     15

This is in line with OP's expected result.

As a second test case, I have modified the data from OP's edit to include a second action in a one of the groups:

sales_action(mydat2, 1L, 2L)

Action pattern used: 01+00
    code item sales action output
 1:    a    b     2      0      2
 2:    a    b     4      0      4
 3:    a    b     3      0      3
 4:    a    b    10      1      3
 5:    a    b     4      1      3
 6:    a    b     2      0      2
 7:    a    b     4      0      4
 8:    a    b     3      0      3
 9:    a    b    10      1      6
10:    a    b     4      1      4
11:    a    b    10      0     10
12:    a    b     6      0      6
13:    a    b     6      0      6
14:    c    d     2      0      2
15:    c    d     4      0      4
16:    c    d     3      0      3
17:    c    d    10      1      6
18:    c    d    10      0     10
19:    c    d     6      0      6
20:    c    d     6      0      6

The sample includes two actions for the first product, both with a duration of 2 days and one action of 1 day duration for the second product.

For rows 4, 5 the median of the surrounding zero action rows, i.e, median(c(3, 2, 4)) = 3, was taken.

For rows 9, 10, the median of c(3, 10, 6) is 6 which is less than the actual sales in row 9. So, only row 9 was replaced by the median value.

For row 17 the median of c(3, 10, 6) is 6 which replace the actual sales figure in output.

If called for 3 zero action days before and after we get

sales_action(mydat2, 3L, 3L)

Action pattern used: 0001+(?=000)
    code item sales action output
 1:    a    b     2      0      2
 2:    a    b     4      0      4
 3:    a    b     3      0      3
 4:    a    b    10      1      3
 5:    a    b     4      1      3
 6:    a    b     2      0      2
 7:    a    b     4      0      4
 8:    a    b     3      0      3
 9:    a    b    10      1      5
10:    a    b     4      1      4
11:    a    b    10      0     10
12:    a    b     6      0      6
13:    a    b     6      0      6
14:    c    d     2      0      2
15:    c    d     4      0      4
16:    c    d     3      0      3
17:    c    d    10      1      5
18:    c    d    10      0     10
19:    c    d     6      0      6
20:    c    d     6      0      6

Explanation

The key point is to identify which rows belong to the period around each streak of action days. As action consists only of 0 and 1 we can use pattern matching in character strings with an regular expression.

For this, the action column is collapsed into a character string (separately for each code, item group). Then, stringr::str_locate_all() is used to find the start and end positions of the action pattern. action pattern is a regular expression that is looking for any sequence of 1s surrounded by the required number of leading and trailing 0s, resp.

In fact, the regular expression is somewhat more complicated as we have to use lookahead in order to capture overlapping action patterns such as 000111000 in 000111000111000. The end position of the lookahead regex points to the last 1 in each sequence instead of the last 0, so end will be adjusted later on.

Finally, the start and end positions are converted into row locations in DF rather than locations relativ to the group and are returned in tmp.

Now, we do a non-equi join which aggregates and updates DF with an additional med column which contains the median sales of the zero action rows which belong to each start, end range.

The remaining steps are to prepare the output column and to remove the helper columns.

Data

mydat2 <-
structure(list(code = c("a", "a", "a", "a", "a", "a", "a", "a", 
"a", "a", "a", "a", "a", "c", "c", "c", "c", "c", "c", "c"), 
    item = c("b", "b", "b", "b", "b", "b", "b", "b", "b", "b", 
    "b", "b", "b", "d", "d", "d", "d", "d", "d", "d"), sales = c(2L, 
    4L, 3L, 10L, 4L, 2L, 4L, 3L, 10L, 4L, 10L, 6L, 6L, 2L, 4L, 
    3L, 10L, 10L, 6L, 6L), action = c(0L, 0L, 0L, 1L, 1L, 0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA, 
-20L), class = "data.frame")

这篇关于替换中位数时，Rscript工作不正确的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

替换中位数时，Rscript工作不正确 [英] incorrect Rscript work when replacing medians

问题描述

编辑

说明

数据

edit

Explanation

Data

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

替换中位数时，Rscript工作不正确 [英] incorrect Rscript work when replacing medians

问题描述

编辑

说明

数据

edit

Explanation

Data

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭