组合方案以按R中的组替换中位数 [英] Combining scenario to Replace Medians by Groups in R
问题描述
我有数据集
mydat <-
structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("52382MCK",
"52499MCK"), class = "factor"), item = c(11709L, 11709L, 11709L,
11709L, 11708L, 11708L, 11708L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11202L, 11203L, 11203L, 11204L, 11204L, 11205L, 11205L
), sales = c(30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L,
15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L,
20L, 15L, 2L, 10L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), action = c(0L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
1L, 1L)), row.names = c(NA, -35L), class = "data.frame")
# coerce to data.table
setDT(mydat)
使用此数据集,将执行多项操作。
with this dataset, several operations are performed.
1. selecting scenario by groups.
因此有操作列。它只能有两个值零(0)或一个(1)。
So there is action column. It can have only two values zero(0) or one(1).
方案的场景是第一个动作类别之前的零个动作类别数量和一类动作后为零。
The scenarios are the number of zero categories of action before first category of action and the number of zeros after one category of action.
For example
52382МСК 11709
在这种情况下,我们有1个零类别的操作列。在动作col的第一类别之前,以及在动作col的第一类别之后的两个零。注意:当我们有2个零类别的动作col时,可能是这种情况。
it is scenario when we have 1 zero category of action col. before first category of action col , and two zeros after first category of action col. Note: maybe scenario when we have 2 zero category of action col. before first category of action col , and 1 zero after first category of action col.
mydat1
code item sales action
52382МСК 11709 30 0
52382МСК 11709 10 1
52382МСК 11709 20 0
52382МСК 11709 15 0
要检测这种情况,我使用此脚本/
该脚本非常有效,感谢@Uwe
to detect this scenario i use this script/ This script very well works, thank for @Uwe
library(data.table)
library(magrittr)
max_zeros <- 3
zeros <- sapply(0:max_zeros, stringr::str_dup, string = "0")
names(zeros) <- as.character(nchar(zeros))
sc <- CJ(zeros.before = zeros, zeros.after = zeros)[
, scenario.name := paste(nchar(zeros.before), nchar(zeros.after), sep = "-")][
, action.pattern := sprintf("%s1+(?=%s)", zeros.before, zeros.after)][]
# special case: all zero
sc0 <- data.table(
zeros.before = NA,
zeros.after = NA,
scenario.name = "no1",
action.pattern = "^0+$")
sc <- rbind(sc0, sc)
然后
setDT(mydat)
class <- mydat[, .(scenario.name = sc$scenario.name[
paste(action, collapse = "") %>%
stringr::str_count(sc$action.pattern) %>%
is_greater_than(0) %>%
which() %>%
max()
]),
by = .(code, item)][]
class
mydat[class, on = .(code, item)]
所以我获得了情景类别的数据。
So i get data with class of scenario.
2.operation it is replace median.
对于每种情况,均按零类别计算中位数。
For each scenario median by zero category is calculated.
我需要按操作列按1前面的零类别来计算中值,即按一列操作列之前的零,按操作列按2个零来计算中位数。
仅按销售列对第一类动作列
进行的中位数替换。
如果中位数大于销售额,则不要替换。
I need to calculate the median by 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. The median replacing performed only for first category of action column by sale column. if median is more than the sales, then do not replace it.
要做到这一点,我使用脚本
To do it i use the script
sales_action <- function(DF, zeros_before, zeros_after) {
library(data.table)
library(magrittr)
action_pattern <-
do.call(sprintf,
c(fmt = "%s1+(?=%s)",
stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
))
message("Action pattern used: ", action_pattern)
setDT(DF)[, rn := .I]
tmp <- DF[, paste(action, collapse = "") %>%
stringr::str_locate_all(action_pattern) %>%
as.data.table() %>%
lapply(function(x) rn[x]),
by = .(code, item)][
, end := end + zeros_after]
DF[tmp, on = .(code, item, rn >= start, rn <= end),
med := as.double(median(sales[action == 0])), by = .EACHI][
, output := as.double(sales)][action == 1, output := pmin(sales, med)][
, c("rn", "med") := NULL][]
}
然后
sales_action(mydat, 1L, 2L)
所以我得到了结果。
每次我必须手动输入要用中位数替换的情况
Each time i must manually enter the scenario to replacing by median
sales_action(mydat, 1L, 2L)
sales_action(mydat, 3L, 1L)
sales_action(mydat, 2L, 2L)
等。
如何为此,替换中位数会自动针对所有可能的情况执行
,这样我就不会每次都写
sales_action(mydat,.L,.L)
How to do that replacing median was perform for all possible scenarios automatically so that I do not write every time sales_action(mydat, .L, .L)
因此是输出示例
code i tem sales action output pattern
52382MCK 11709 30 0 30 01+00
52382MCK 11709 10 1 10 01+00
52382MCK 11709 20 0 20 01+00
52382MCK 11709 15 0 15 01+00
52382MCK 1170 8 0 8 01+00
52382MCK 1170 10 1 8 01+00
52382MCK 1170 2 0 2 01+00
52382MCK 1170 15 0 15 01+00
推荐答案
如果我理解正确,则OP希望通过将操作期间的 sales
数据与销售的中位数进行比较来分析销售操作的成功
If I understand correctly, the OP wants to analyse the success of sales actions by comparing sales
figures during actions with the median sales of the periods immediately before and after the sales action.
存在一些挑战:
- 每个
代码
,项目
组可能有多个销售操作。 - 在一次销售行动前后,可用数据可能比请求的3天少三天。
- There might be more than one sales action per
code
,item
group. - The available data might cover less than the requested 3 three days each before and after a sales action.
恕我直言,场景的引入是解决问题2的绕道而行。
IMHO, the introduction of scenarios is a detour to handle issue 2.
Th下面的方法
- 标识每个
代码
,item
组, - 在之前最多添加三个零动作行,在之后最多添加三行>每个销售操作,
- 计算这些行的中位数销售,并且
- 更新
输出
如果销售操作中的销售数字超过了周围的零操作行的中位数。
- identifies the sales actions within each
code
,item
group, - picks up to three zero action rows before and up to three rows after each sales action,
- computes the median sales of those rows, and
- updates
output
in case the sales figure within a sales action exceeds the median of the surrounding zero action rows.
术语类别是由OP创造的,用于区分销售活动的时间段( action == 1L
的连续条纹)和之前和之后的零操作时间段。
The term category has been coined by the OP to distinguish between periods of sales actions (contiguous streaks of action == 1L
) and the zero action periods before and after.
library(data.table)
# coerce to data.table and create categories
setDT(mydat)[, cat := rleid(action), by = .(code, item)][]
# extract action categories, identify preceeding & succeeding zero action categories
mycat <- mydat[, .(action = first(action)), by = .(code, item, cat = cat)][
, `:=`(before = cat - 1L, after = cat + 1L)][action == 1L]
mycat
code item cat action before after
1: 52382MCK 11709 2 1 1 3
2: 52382MCK 11708 2 1 1 3
3: 52382MCK 11710 2 1 1 3
4: 52382MCK 11710 4 1 3 5
5: 52382MCK 11710 6 1 5 7
6: 52499MCK 11203 2 1 1 3
7: 52499MCK 11205 1 1 0 2
请注意,组 52382MCK,11710
包含三个单独的销售操作。在和在
可能指向不存在的
cat
,但是
Note that group 52382MCK, 11710
includes three separate sales actions. before
and after
may point to non-existing cat
but this will be rectified automatically during the subsequent joins.
# compute median of surrouding zero action categories
action_cat_median <-
rbind(
# get sales from up to 3 zero action rows before action category
mydat[mycat, on = .(code, item, cat = before),
.(sales = tail(sales, 3), i.cat), by =.EACHI],
# get sales from up to 3 zero action rows after action category
mydat[mycat, on = .(code, item, cat = after),
.(sales = head(sales, 3), i.cat), by =.EACHI]
)[
# remove empty groups
!is.na(sales)][
# compute median for each action category
, .(med = as.double(median(sales))), by = .(code, item, cat = i.cat)]
action_cat_median
code item cat med
1: 52382MCK 11709 2 20.0
2: 52382MCK 11708 2 2.5
3: 52382MCK 11710 2 10.0
4: 52382MCK 11710 4 10.0
5: 52382MCK 11710 6 10.0
6: 52499MCK 11203 2 2.0
# prepare result
mydat[, output := as.double(sales)][
# update join
action_cat_median, on = .(code, item, cat), output := pmin(sales, med)]
编辑:或者,调用 pmin()
可以替换为非装备联接,该联接仅更新销售额超过中位数的行:
Alternatively, the call to pmin()
can be replaced by a non-equi join which updates only rows where sales exceeds the median:
# prepare result, alternative approach
mydat[, output := as.double(sales)][
# non-equi update join
action_cat_median, on = .(code, item, cat, output > med), output := med]
mydat
code item sales action cat output
1: 52382MCK 11709 30 0 1 30.0
2: 52382MCK 11709 10 1 2 10.0
3: 52382MCK 11709 20 0 3 20.0
4: 52382MCK 11709 15 0 3 15.0
5: 52382MCK 11708 2 0 1 2.0
6: 52382MCK 11708 10 1 2 2.5
7: 52382MCK 11708 3 0 3 3.0
8: 52382MCK 11710 30 0 1 30.0
9: 52382MCK 11710 10 0 1 10.0
10: 52382MCK 11710 20 0 1 20.0
11: 52382MCK 11710 15 1 2 10.0
12: 52382MCK 11710 2 0 3 2.0
13: 52382MCK 11710 10 0 3 10.0
14: 52382MCK 11710 3 0 3 3.0
15: 52382MCK 11710 30 0 3 30.0
16: 52382MCK 11710 10 0 3 10.0
17: 52382MCK 11710 20 0 3 20.0
18: 52382MCK 11710 15 1 4 10.0
19: 52382MCK 11710 2 0 5 2.0
20: 52382MCK 11710 10 0 5 10.0
21: 52382MCK 11710 3 0 5 3.0
22: 52382MCK 11710 30 0 5 30.0
23: 52382MCK 11710 10 0 5 10.0
24: 52382MCK 11710 20 0 5 20.0
25: 52382MCK 11710 15 1 6 10.0
26: 52382MCK 11710 2 0 7 2.0
27: 52382MCK 11710 10 0 7 10.0
28: 52382MCK 11710 3 0 7 3.0
29: 52499MCK 11202 2 0 1 2.0
30: 52499MCK 11203 2 0 1 2.0
31: 52499MCK 11203 2 1 2 2.0
32: 52499MCK 11204 2 0 1 2.0
33: 52499MCK 11204 2 0 1 2.0
34: 52499MCK 11205 2 1 1 2.0
35: 52499MCK 11205 2 1 1 2.0
code item sales action cat output
以下行已更新:
mydat[output != sales]
code item sales action cat output
1: 52382MCK 11708 10 1 2 2.5
2: 52382MCK 11710 15 1 2 10.0
3: 52382MCK 11710 15 1 4 10.0
4: 52382MCK 11710 15 1 6 10.0
这篇关于组合方案以按R中的组替换中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!