计算90%百分位数并用R中的中位数按组替换 [英] calculation of 90 percentile and replacement of it by median by groups in R
问题描述
这是数据的一部分.
mydat=structure(list(code = c(123L, 123L, 123L, 123L, 123L, 123L, 123L,
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L,
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L,
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L,
123L, 123L, 123L, 123L, 123L, 123L, 123L, 222L, 222L, 222L, 222L,
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L,
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L,
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L,
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L),
item = c(234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 333L,
333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L,
333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L,
333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L,
333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L,
333L, 333L, 333L, 333L, 333L, 333L), return = c(25L, 25L,
21L, 37L, 23L, 27L, 19L, 7L, 16L, 12L, 33L, 24L, 6L, 14L,
4L, 25L, 90L, 27L, 3L, 16L, 7L, 1L, 13L, 11L, 36L, 5L, 6L,
14L, 11L, 41L, 11L, 6L, 4L, 11L, 3L, 6L, 21L, 41L, 28L, 30L,
92L, 4L, 1L, 83L, 3L, 16L, 4L, 25L, 25L, 21L, 37L, 23L, 27L,
19L, 7L, 16L, 12L, 33L, 24L, 6L, 14L, 4L, 25L, 90L, 27L,
3L, 16L, 7L, 1L, 13L, 11L, 36L, 5L, 6L, 14L, 11L, 41L, 11L,
6L, 4L, 11L, 3L, 6L, 21L, 41L, 28L, 30L, 92L, 4L, 1L, 83L,
3L, 16L, 4L), action = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("code",
"item", "return", "action"), class = "data.frame", row.names = c(NA,
-94L))
我有2个小组vars代码和项目.这是两组:
I have 2 group vars code+item. Here two groups:
123 234
222 333
我也有行动专栏.它只能有两个值(类别)零(0)或一(1).
Also i have action column. It can have only two values(category) zero(0) or one(1).
我需要通过返回列的零操作类别来计算90 percentile
,该类别在一个操作类别之前.
然后,我需要通过返回列的零操作类别来计算median
,该类别排在一个操作类别之前. (一之后我们就不会碰零)
i need calculate 90 percentile
by zero category of action of return column, which go before one category of action.
Then i need calculate the median
by zero category of action of return column, which go before one category of action. (After one we don't touch zeros)
然后我必须找到上面计算出的90%以上的值,然后必须用计算出的中位数来代替这些值.
Then i have to find values that more 90 percentile which was calculated above, then such values must be replce by the median which was calculated.
在执行一类动作之后,再次将零类别用于返回列.为此,我也必须找到上面计算得出的90%以上的值,然后必须用上面计算得出的中位数(当计算14个零时)替换该值.
After one category of action go zero category again for return column. For it i too have to find value that more 90 percentile which was calculeted above, then such value must be replaced by the median which was calculeted above(when calculate for 14 zeros).
请注意,calculation
是由一类动作之前的14个零完成的
但对所有zero category
动作均按中值显示replacing
并为每个组执行code+item
Note the calculation
is done by 14 zeros preceding the one category of action
but replacing
by median is done for all zero category
of action
and performing for each groups code+item
结果可以在输出列中.
在此处更加清晰地显示所需的输出.
to be more clear here desired output.
对于123+234
组
90 perc = 39,5
中位数= 12
for 123+234
group
the 90 perc=39,5
median=12
对于222+333
90 perc = 39,5
中位数= 12
for 222+333
the 90 perc=39,5
median=12
code item return action output
1 123 234 25 0 25
2 123 234 25 0 25
3 123 234 21 0 21
4 123 234 37 0 16
5 123 234 23 0 23
6 123 234 27 0 27
7 123 234 19 0 19
8 123 234 7 0 7
9 123 234 16 0 16
10 123 234 12 0 12
11 123 234 33 0 33
12 123 234 24 0 24
13 123 234 6 0 6
14 123 234 14 0 14
15 123 234 4 0 4
16 123 234 25 0 25
17 123 234 90 0 **12**
18 123 234 27 0 27
19 123 234 3 0 3
20 123 234 16 0 16
21 123 234 7 0 7
22 123 234 1 0 1
23 123 234 13 0 13
24 123 234 11 0 11
25 123 234 36 0 36
26 123 234 5 0 5
27 123 234 6 0 6
28 123 234 14 0 14
29 123 234 11 0 11
30 123 234 41 0 16
31 123 234 11 1 Na
32 123 234 6 1 Na
33 123 234 4 1 Na
34 123 234 11 1 Na
35 123 234 3 0 3
36 123 234 6 0 6
37 123 234 21 0 21
38 123 234 41 0 **12**
39 123 234 28 0 28
40 123 234 30 0 30
41 123 234 92 0 **12**
42 123 234 4 0 4
43 123 234 1 0 1
44 123 234 83 0 **12**
45 123 234 3 0 3
46 123 234 16 0 16
47 123 234 4 0 4
48 222 333 25 0 25
49 222 333 25 0 25
50 222 333 21 0 21
51 222 333 37 0 16
52 222 333 23 0 23
53 222 333 27 0 27
54 222 333 19 0 19
55 222 333 7 0 7
56 222 333 16 0 16
57 222 333 12 0 12
58 222 333 33 0 33
59 222 333 24 0 24
60 222 333 6 0 6
61 222 333 14 0 14
62 222 333 4 0 4
63 222 333 25 0 25
64 222 333 90 0 **12**
65 222 333 27 0 27
66 222 333 3 0 3
67 222 333 16 0 16
68 222 333 7 0 7
69 222 333 1 0 1
70 222 333 13 0 13
71 222 333 11 0 11
72 222 333 36 0 36
73 222 333 5 0 5
74 222 333 6 0 6
75 222 333 14 0 14
76 222 333 11 0 11
77 222 333 41 0 16
78 222 333 11 1 Na
79 222 333 6 1 Na
80 222 333 4 1 Na
81 222 333 11 1 Na
82 222 333 3 0 3
83 222 333 6 0 6
84 222 333 21 0 21
85 222 333 41 0 **12**
86 222 333 28 0 28
87 222 333 30 0 30
88 222 333 92 0 **12**
89 222 333 4 0 4
90 222 333 1 0 1
91 222 333 83 0 **12**
92 222 333 3 0 3
93 222 333 16 0 16
94 222 333 4 0 4
**我标记了其中值用中位数代替的行.
** i marked rows where value was replced by median.
推荐答案
如果我理解正确,则OP希望
If I understand correctly, the OP wants to
- 分别计算最后14个零动作行"(即带有
action == 0
的行)的90%分位数和中位数,先于使用action == 1
的第一行,分别为每个, item
组. (这意味着每个组中只有action == 1
个值的条纹. - 将所有具有
action == 0
的行的return
值复制到output
. - 如果
output
值比每个组中第一action == 1
行之前的 all 所有零操作行中位数的90%分位数大,则将其替换.
- compute the 90% quantile and the median for the last 14 "zero action rows" (i.e., rows with
action == 0
) before the first row withaction == 1
, separately for eachcode
,item
group. (This implies that there is only one streak ofaction == 1
values per group. - Copy the
return
values to theoutput
for all rows withaction == 0
. - Replace the
output
value if it is larger than the 90% quantile by the median in all zero action rows before the firstaction == 1
row in each group.
可以通过通过非等额联接更新并进行一些准备工作来解决
This can be solved by updating in a non-equi join with some preparations
library(data.table)
# mark the zero acton rows before the the action period
setDT(mydat)[, zero_before := cummax(action), by = .(code, item)]
# compute median and 90% quantile for that last 14 rows before each action period
agg <- mydat[zero_before == 0,
quantile(tail(return, 14L), c(0.5, 0.9)) %>%
as.list() %>%
set_names(c("med", "q90")) %>%
c(.(zero_before = 0)), by = .(code, item)]
agg
code item med q90 zero_before
1: 123 234 12 39.5 0
2: 222 333 12 39.5 0
# append output column
mydat[action == 0, output := as.double(return)][
# replace output values greater q90 in an update non-equi join
agg, on = .(code, item, action, return > q90), output := as.double(med)][
# remove helper column
, zero_before := NULL]
由于mydat
有94行,我们将仅显示更新的行:
As mydat
has 94 rows, we will show only the updated rows:
mydat[return != output]
code item return action output
1: 123 234 90 0 12
2: 123 234 41 0 12
3: 123 234 41 0 12
4: 123 234 92 0 12
5: 123 234 83 0 12
6: 222 333 90 0 12
7: 222 333 41 0 12
8: 222 333 41 0 12
9: 222 333 92 0 12
10: 222 333 83 0 12
这篇关于计算90%百分位数并用R中的中位数按组替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!