替换中位数时,Rscript工作不正确 [英] incorrect Rscript work when replacing medians
问题描述
我有数据集
mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "52382МСК", class = "factor"), item = c(11709L,
11709L, 11709L, 11709L, 1170L, 1170L, 1170L, 1170L), sales = c(30L,
10L, 20L, 15L, 8L, 10L, 2L, 15L), action = c(0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L)), .Names = c("code", "item", "sales", "action"
), class = "data.frame", row.names = c(NA, -8L))
按代码和项目分为两组
code item
52382МСК 11709
52382МСК 1170
我也有行动专栏.它只能有两个值零(0)或一(1).我需要按操作列按1前面的零类别来计算中位数,即按操作列的一类之前的零,按操作列按2零来计算中位数的中位数. 如果中位数大于销售额,则不要替换.
如果我按动作列具有三个前面的零类别,即在动作列的一个类别之前具有三个零,并且按动作列具有一个类别的三个零之前,则此解决方案很好. 但是如果我按动作列有1个在零前面的零,即在动作列的一类前面有零,而在动作列的2个零在那一类之后.它无法正常工作
replacements <-
data_frame(
action1 = which(mydat$action == 1L),
group = rep(1:length(action1), each = 2, length.out = length(action1)),
sales1 = mydat$sales[action1],
sales_before = mydat$sales[action1 -1L],
sales_after = mydat$sales[action1 +2L]
) %>%
group_by(group) %>%
mutate(
med = median(c(sales_before, sales_after)),
output = pmin(sales1, med)
)
mydat$output <- mydat$sales
mydat$output[replacements$action1] <- replacements$output
我得到输出
code item sales action output
1 52382МСК 11709 30 0 30
2 52382МСК 11709 10 1 10
3 52382МСК 11709 20 0 20
4 52382МСК 11709 15 0 15
5 52382МСК 1170 8 0 8
6 52382МСК 1170 10 1 10
7 52382МСК 1170 2 0 2
8 52382МСК 1170 15 0 15
但输出应该是
code item sales action output
1 52382МСК 11709 30 0 30
2 52382МСК 11709 10 1 10
3 52382МСК 11709 20 0 20
4 52382МСК 11709 15 0 15
5 52382МСК 1170 8 0 8
6 52382МСК 1170 10 1 **8**
7 52382МСК 1170 2 0 2
8 52382МСК 1170 15 0 15
我如何获得正确的输出?
编辑
code item sales action
1 a b 2 0
2 a b 4 0
3 a b 3 0
4 a b 10 1
5 a b 4 1
6 a b 10 0
7 a b 6 0
8 a b 6 0
9 c d 2 0
10 c d 4 0
11 c d 3 0
12 c d 10 1
13 c d 10 0
14 c d 6 0
15 c d 6 0
该代码有几个严重缺陷:
- 它完全忽略了按
code
和item
进行的分组
- 它仅选择两个值进行中值计算,而不是零操作行的整个范围,而OP要求在每个
action == 1
之前包括1行,在每个action == 1
之后包括2行.
如果我正确理解OP的要求,
- OP希望通过计算每个销售操作前后的一段时间内的平均销售额(不包括该操作期间的销售量)并将其与实际销售量进行比较来衡量销售活动的效果
- 分别用
code
和item
标识的每种产品. - 每个销售操作的时间长度可能会有所不同(条纹为
action == 1
) - 以及每次操作前后的天数.
- 预期产出是零工作日的销售数字.在工作日,该数字将被周围的零工作日的中位数销售代替,但前提是小于实际销售数字.
下面的函数采用三个参数,日期框和 之前和之后销售活动的零天数.它会返回一个data.table,并按上述规则定义添加output
列.
sales_action <- function(DF, zeros_before, zeros_after) {
library(data.table)
library(magrittr)
action_pattern <-
do.call(sprintf,
c(fmt = "%s1+(?=%s)",
stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
))
message("Action pattern used: ", action_pattern)
setDT(DF)[, rn := .I]
tmp <- DF[, paste(action, collapse = "") %>%
stringr::str_locate_all(action_pattern) %>%
as.data.table() %>%
lapply(function(x) rn[x]),
by = .(code, item)][
, end := end + zeros_after]
DF[tmp, on = .(code, item, rn >= start, rn <= end),
med := as.double(median(sales[action == 0])), by = .EACHI][
, output := as.double(sales)][action == 1, output := pmin(sales, med)][
, c("rn", "med") := NULL][]
}
对于OP给出的mydat
,我们得到
sales_action(mydat, 1L, 2L)
Action pattern used: 01+00 code item sales action output 1: 52382MCK 11709 30 0 30 2: 52382MCK 11709 10 1 10 3: 52382MCK 11709 20 0 20 4: 52382MCK 11709 15 0 15 5: 52382MCK 1170 8 0 8 6: 52382MCK 1170 10 1 8 7: 52382MCK 1170 2 0 2 8: 52382MCK 1170 15 0 15
这符合OP的预期结果.
作为第二个测试用例,我修改了OP编辑中的数据,以在其中一个组中包含第二个动作:
sales_action(mydat2, 1L, 2L)
Action pattern used: 01+00 code item sales action output 1: a b 2 0 2 2: a b 4 0 4 3: a b 3 0 3 4: a b 10 1 3 5: a b 4 1 3 6: a b 2 0 2 7: a b 4 0 4 8: a b 3 0 3 9: a b 10 1 6 10: a b 4 1 4 11: a b 10 0 10 12: a b 6 0 6 13: a b 6 0 6 14: c d 2 0 2 15: c d 4 0 4 16: c d 3 0 3 17: c d 10 1 6 18: c d 10 0 10 19: c d 6 0 6 20: c d 6 0 6
该样本包括第一个产品的两个操作,两个产品的持续时间均为2天,一个产品的持续时间为1天.
对于第4、5行,采用周围零动作行的中值,即median(c(3, 2, 4))
= 3.
对于第9行,第10行,c(3,10,6)的中位数为6,小于第9行的实际销售额.因此,只有第9行被中位数代替.
对于第17行,c(3,10,6)的中位数为6,代替了output
中的实际销售数字.
如果在我们得到通知之前和之后需要3次零行动,那么
sales_action(mydat2, 3L, 3L)
Action pattern used: 0001+(?=000) code item sales action output 1: a b 2 0 2 2: a b 4 0 4 3: a b 3 0 3 4: a b 10 1 3 5: a b 4 1 3 6: a b 2 0 2 7: a b 4 0 4 8: a b 3 0 3 9: a b 10 1 5 10: a b 4 1 4 11: a b 10 0 10 12: a b 6 0 6 13: a b 6 0 6 14: c d 2 0 2 15: c d 4 0 4 16: c d 3 0 3 17: c d 10 1 5 18: c d 10 0 10 19: c d 6 0 6 20: c d 6 0 6
说明
关键是要确定哪些行属于每个工作日条纹前后的时间段.由于action
仅由0
和1
组成,因此我们可以在带有正则表达式的字符串中使用模式匹配.
为此,将action
列折叠为一个字符串(对于每个code
,item
组而言是单独的).然后,使用stringr::str_locate_all()
查找action pattern
的开始和结束位置. action pattern
是一个正则表达式,它查找由所需数量的前导和尾随0
包围的1
的任何序列.
实际上,正则表达式要复杂一些,因为我们必须使用 lookahead 来捕获重叠的动作模式,例如000111000111000
中的000111000
.前瞻正则表达式的end
位置指向每个序列中的最后一个1
而不是最后一个0
,因此稍后将调整end
.
最后,开始位置和结束位置将转换为DF
中的行位置,而不是相对于组的位置,并在tmp
中返回.
现在,我们进行非等额联接,以附加的med
列汇总和更新DF
,该列包含属于每个start
,end
范围的零操作行的中位销售额. /p>
其余步骤是准备output
列并删除帮助器列.
数据
mydat2 <-
structure(list(code = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "c", "c", "c", "c", "c", "c", "c"),
item = c("b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "d", "d", "d", "d", "d", "d", "d"), sales = c(2L,
4L, 3L, 10L, 4L, 2L, 4L, 3L, 10L, 4L, 10L, 6L, 6L, 2L, 4L,
3L, 10L, 10L, 6L, 6L), action = c(0L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA,
-20L), class = "data.frame")
I have dataset
mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "52382МСК", class = "factor"), item = c(11709L,
11709L, 11709L, 11709L, 1170L, 1170L, 1170L, 1170L), sales = c(30L,
10L, 20L, 15L, 8L, 10L, 2L, 15L), action = c(0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L)), .Names = c("code", "item", "sales", "action"
), class = "data.frame", row.names = c(NA, -8L))
it has two groups by code and item
code item
52382МСК 11709
52382МСК 1170
Also i have action column. It can have only two values zero(0) or one(1). I need to calculate the median by 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. if median is more than the sales, then do not replace it.
This solution good works if i have three preceding zeros category by action column, i.e. which go before one category of action column, and by three zeros by action column that go after the one category. but if i have 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. it doesn't work correct
replacements <-
data_frame(
action1 = which(mydat$action == 1L),
group = rep(1:length(action1), each = 2, length.out = length(action1)),
sales1 = mydat$sales[action1],
sales_before = mydat$sales[action1 -1L],
sales_after = mydat$sales[action1 +2L]
) %>%
group_by(group) %>%
mutate(
med = median(c(sales_before, sales_after)),
output = pmin(sales1, med)
)
mydat$output <- mydat$sales
mydat$output[replacements$action1] <- replacements$output
I get output
code item sales action output
1 52382МСК 11709 30 0 30
2 52382МСК 11709 10 1 10
3 52382МСК 11709 20 0 20
4 52382МСК 11709 15 0 15
5 52382МСК 1170 8 0 8
6 52382МСК 1170 10 1 10
7 52382МСК 1170 2 0 2
8 52382МСК 1170 15 0 15
but output should be
code item sales action output
1 52382МСК 11709 30 0 30
2 52382МСК 11709 10 1 10
3 52382МСК 11709 20 0 20
4 52382МСК 11709 15 0 15
5 52382МСК 1170 8 0 8
6 52382МСК 1170 10 1 **8**
7 52382МСК 1170 2 0 2
8 52382МСК 1170 15 0 15
how can i get correct output?
edit
code item sales action
1 a b 2 0
2 a b 4 0
3 a b 3 0
4 a b 10 1
5 a b 4 1
6 a b 10 0
7 a b 6 0
8 a b 6 0
9 c d 2 0
10 c d 4 0
11 c d 3 0
12 c d 10 1
13 c d 10 0
14 c d 6 0
15 c d 6 0
The code has several severe flaws:
- it complete ignores the grouping by
code
anditem
- it picks only two values for median calculation instead of the full range of zero action rows while the OP had requested to include 1 row before and 2 rows after each
action == 1
.
If I understand OP's requirements correctly,
- the OP wants to measure the effect of a sales action by calculating the median sales in a period around each sales action (excluding the sales during the action) and comparing it with the actual sales
- separately for each product identified by
code
anditem
. - The length of each sales action can vary (streaks of
action == 1
) - as well as the number of days before and after each action.
- The expected output is the sales figures on zero action days. On action days, this figure is to be replaced by the median sales of the surrounding zero action days but only if it is less than actual sales figure.
The function below takes three arguments, the dateframe and the number of zero days before and after a sales action. It returns a data.table with the output
column appended as defined by the rules above.
sales_action <- function(DF, zeros_before, zeros_after) {
library(data.table)
library(magrittr)
action_pattern <-
do.call(sprintf,
c(fmt = "%s1+(?=%s)",
stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
))
message("Action pattern used: ", action_pattern)
setDT(DF)[, rn := .I]
tmp <- DF[, paste(action, collapse = "") %>%
stringr::str_locate_all(action_pattern) %>%
as.data.table() %>%
lapply(function(x) rn[x]),
by = .(code, item)][
, end := end + zeros_after]
DF[tmp, on = .(code, item, rn >= start, rn <= end),
med := as.double(median(sales[action == 0])), by = .EACHI][
, output := as.double(sales)][action == 1, output := pmin(sales, med)][
, c("rn", "med") := NULL][]
}
For mydat
as given by the OP we get
sales_action(mydat, 1L, 2L)
Action pattern used: 01+00 code item sales action output 1: 52382MCK 11709 30 0 30 2: 52382MCK 11709 10 1 10 3: 52382MCK 11709 20 0 20 4: 52382MCK 11709 15 0 15 5: 52382MCK 1170 8 0 8 6: 52382MCK 1170 10 1 8 7: 52382MCK 1170 2 0 2 8: 52382MCK 1170 15 0 15
This is in line with OP's expected result.
As a second test case, I have modified the data from OP's edit to include a second action in a one of the groups:
sales_action(mydat2, 1L, 2L)
Action pattern used: 01+00 code item sales action output 1: a b 2 0 2 2: a b 4 0 4 3: a b 3 0 3 4: a b 10 1 3 5: a b 4 1 3 6: a b 2 0 2 7: a b 4 0 4 8: a b 3 0 3 9: a b 10 1 6 10: a b 4 1 4 11: a b 10 0 10 12: a b 6 0 6 13: a b 6 0 6 14: c d 2 0 2 15: c d 4 0 4 16: c d 3 0 3 17: c d 10 1 6 18: c d 10 0 10 19: c d 6 0 6 20: c d 6 0 6
The sample includes two actions for the first product, both with a duration of 2 days and one action of 1 day duration for the second product.
For rows 4, 5 the median of the surrounding zero action rows, i.e, median(c(3, 2, 4))
= 3, was taken.
For rows 9, 10, the median of c(3, 10, 6) is 6 which is less than the actual sales in row 9. So, only row 9 was replaced by the median value.
For row 17 the median of c(3, 10, 6) is 6 which replace the actual sales figure in output
.
If called for 3 zero action days before and after we get
sales_action(mydat2, 3L, 3L)
Action pattern used: 0001+(?=000) code item sales action output 1: a b 2 0 2 2: a b 4 0 4 3: a b 3 0 3 4: a b 10 1 3 5: a b 4 1 3 6: a b 2 0 2 7: a b 4 0 4 8: a b 3 0 3 9: a b 10 1 5 10: a b 4 1 4 11: a b 10 0 10 12: a b 6 0 6 13: a b 6 0 6 14: c d 2 0 2 15: c d 4 0 4 16: c d 3 0 3 17: c d 10 1 5 18: c d 10 0 10 19: c d 6 0 6 20: c d 6 0 6
Explanation
The key point is to identify which rows belong to the period around each streak of action days. As action
consists only of 0
and 1
we can use pattern matching in character strings with an regular expression.
For this, the action
column is collapsed into a character string (separately for each code
, item
group). Then, stringr::str_locate_all()
is used to find the start and end positions of the action pattern
. action pattern
is a regular expression that is looking for any sequence of 1
s surrounded by the required number of leading and trailing 0
s, resp.
In fact, the regular expression is somewhat more complicated as we have to use lookahead in order to capture overlapping action patterns such as 000111000
in 000111000111000
. The end
position of the lookahead regex points to the last 1
in each sequence instead of the last 0
, so end
will be adjusted later on.
Finally, the start and end positions are converted into row locations in DF
rather than locations relativ to the group and are returned in tmp
.
Now, we do a non-equi join which aggregates and updates DF
with an additional med
column which contains the median sales of the zero action rows which belong to each start
, end
range.
The remaining steps are to prepare the output
column and to remove the helper columns.
Data
mydat2 <-
structure(list(code = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "c", "c", "c", "c", "c", "c", "c"),
item = c("b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "d", "d", "d", "d", "d", "d", "d"), sales = c(2L,
4L, 3L, 10L, 4L, 2L, 4L, 3L, 10L, 4L, 10L, 6L, 6L, 2L, 4L,
3L, 10L, 10L, 6L, 6L), action = c(0L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA,
-20L), class = "data.frame")
这篇关于替换中位数时,Rscript工作不正确的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!