为什么嵌套ifelse会在dplyr 0.5.0变异中创建不正确的结果? [英] Why does nested ifelse create incorrect results in dplyr 0.5.0 mutate?
问题描述
考虑以下数据框架:
(tmp_df< -
结构(list(class = c 0L,0L,1L,1L,2L,2L),logi = c(TRUE,
FALSE,TRUE,FALSE,TRUE,FALSE),val = c(1,1,1,1,1,1) ,
taken = c(1.00684931506849,0.333197278911565,1.025,0.975609756097561,
1.00826446280992,0.991803278688525)),class = c(tbl_df,
tbl,data.frame), row.names = c(NA,-6L),.Names = c(class,
logi,val,taken)))
创建:
源:本地数据框[6 x 4]
class logi val
< int> < LGL> < DBL> < DBL>
1 0 TRUE 1 1.0068493
2 0 FALSE 1 0.9931973
3 1 TRUE 1 1.0250000
4 1 FALSE 1 0.9756098
5 2 TRUE 1 1.0082645
6 2 FALSE 1 0.9918033
我希望按类别分组,如果每个组包含两个成员, 1从 val
如果 logi == FALSE
,否则减去 / code>在
val
中。如果每个组不包含两个成员,那么我们从 val
中减去零。
使用 dplyr
可以使用以下表达方式:
tmp_df%>%
group_by(class)%>%
mutate(taken_2 = ifelse(n()!= 2,0,
ifelse(logi,min(taken),1)),
not_taken = val - taken_2)
但是,这会产生不正确的结果,第二个 ifelse
始终解决第一个条件:
来源:本地数据框架[6 x 6]
组:class [3]
class logi val taken taken_2 not_taken
< int> < LGL> < DBL> < DBL> < DBL> < DBL>
1 0 TRUE 1 1.0068493 0.9931973 0.006802721
2 0 FALSE 1 0.9931973 0.9931973 0.006802721
3 1 TRUE 1 1.0250000 0.9756098 0.024390244
4 1 FALSE 1 0.9756098 0.9756098 0.024390244
5 2 TRUE 1 1.0082645 0.9918033 0.008196721
6 2 FALSE 1 0.9918033 0.9918033 0.008196721
正确的结果可以是如果我们没有第一个 ifelse
语句,则生成。
tmp_df% >%
group_by(class)%>%
mutate(taken_2 = ifelse(logi,min(已取),1),
not_taken = val - taken_2)
生产:
来源:本地数据框架[6 x 6]
组:class [3]
class logi val taken taken_2 not_taken
< int> < LGL> < DBL> < DBL> < DBL> < DBL>
1 0 TRUE 1 1.0068493 0.9931973 0.006802721
2 0 FALSE 1 0.9931973 1.0000000 0.000000000#正确!
3 1 TRUE 1 1.0250000 0.9756098 0.024390244
4 1 FALSE 1 0.9756098 1.0000000 0.000000000#正确!
5 2 TRUE 1 1.0082645 0.9918033 0.008196721
6 2 FALSE 1 0.9918033 1.0000000 0.000000000#正确!
我们可以看到这个问题似乎被隔离到 mutate
和嵌套的 ifelse
,检查其他成功执行类似操作的代码片段:
tmp_df%>%
group_by(class)%>%
mutate(taken_2 = ifelse(n()!= 3,0,
ifelse(logi, min(take),1)),
not_taken = val - taken_2)
tmp_df_2< -
tmp_df%>%
filter(row_number ; = 2)
(tmp_df_2 $ taken_2 < -
ifelse(c(0,0),0,
ifelse(tmp_df_2 $ logi,min(tmp_df_2 $ taken) ,1)))
##但以下不起作用(检查问题与分组无关)
#tmp_df_2%>%
#mutate(taken_2 = ifelse(n()!= 2,0,
#ifelse(logi,min(taken),1)),
#not_taken = val - taken_2)
为什么会发生这种情况?获得预期的行为?解决方法是将嵌套的 ifelse
逻辑拆分为多个内嵌变异:
tmp_df%>%
group_by(class)%>%
mutate(taken_2 = ifelse(n()!= 2,0,1),
taken_3 = taken_2 * ifelse(logi,min(已取),1),
not_taken = val - taken_3)
其他人已经确定了嵌套ifelse的类似问题,但我不知道它是否具有相同的根:
您是 ifelse
矢量循环的受害者。他们的关键是这一行:
mutate(taken_2 = ifelse(n()!= 2,0,
ifelse (logi,min(take),1))
因为 n() != 2
是length-1(对于每个组), ifelse
仅考虑第一个 logi
并重复/回收此值。
如果和 if_else
:
mutate(taken_2 = if(n()!= 2)0 else if_else(logi, min(take),1))
我建议永不使用 ifelse
。由于这个确切的错误,几乎造成了数百万美元错误的人。从
Consider the following data frame:
(tmp_df <-
structure(list(class = c(0L, 0L, 1L, 1L, 2L, 2L), logi = c(TRUE,
FALSE, TRUE, FALSE, TRUE, FALSE), val = c(1, 1, 1, 1, 1, 1),
taken = c(1.00684931506849, 0.993197278911565, 1.025, 0.975609756097561,
1.00826446280992, 0.991803278688525)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("class",
"logi", "val", "taken")))
which creates:
Source: local data frame [6 x 4]
class logi val taken
<int> <lgl> <dbl> <dbl>
1 0 TRUE 1 1.0068493
2 0 FALSE 1 0.9931973
3 1 TRUE 1 1.0250000
4 1 FALSE 1 0.9756098
5 2 TRUE 1 1.0082645
6 2 FALSE 1 0.9918033
I wish to group by class, and if each group contains two members, then subtract 1 from val
if logi == FALSE
, otherwise, subtract the minimum value of taken
in that group from val
. If each group does not contain two members, then we subtract zero from val
.
Code using dplyr
package to do the above can be expressed using:
tmp_df %>%
group_by(class) %>%
mutate(taken_2 = ifelse(n() != 2, 0,
ifelse(logi, min(taken), 1)),
not_taken = val - taken_2)
However, this produces the incorrect result, where by the second ifelse
always resolves to the first condition:
Source: local data frame [6 x 6]
Groups: class [3]
class logi val taken taken_2 not_taken
<int> <lgl> <dbl> <dbl> <dbl> <dbl>
1 0 TRUE 1 1.0068493 0.9931973 0.006802721
2 0 FALSE 1 0.9931973 0.9931973 0.006802721
3 1 TRUE 1 1.0250000 0.9756098 0.024390244
4 1 FALSE 1 0.9756098 0.9756098 0.024390244
5 2 TRUE 1 1.0082645 0.9918033 0.008196721
6 2 FALSE 1 0.9918033 0.9918033 0.008196721
The correct result can be produced if we do not have the first ifelse
statement.
tmp_df %>%
group_by(class) %>%
mutate(taken_2 = ifelse(logi, min(taken), 1),
not_taken = val - taken_2)
producing:
Source: local data frame [6 x 6]
Groups: class [3]
class logi val taken taken_2 not_taken
<int> <lgl> <dbl> <dbl> <dbl> <dbl>
1 0 TRUE 1 1.0068493 0.9931973 0.006802721
2 0 FALSE 1 0.9931973 1.0000000 0.000000000 # correct!
3 1 TRUE 1 1.0250000 0.9756098 0.024390244
4 1 FALSE 1 0.9756098 1.0000000 0.000000000 # correct!
5 2 TRUE 1 1.0082645 0.9918033 0.008196721
6 2 FALSE 1 0.9918033 1.0000000 0.000000000 # correct!
We can see that this problem seems to be isolated to mutate
and the nested ifelse
by examining other code fragments that successfully do similar stuff:
tmp_df %>%
group_by(class) %>%
mutate(taken_2 = ifelse(n() != 3, 0,
ifelse(logi, min(taken), 1)),
not_taken = val - taken_2)
tmp_df_2 <-
tmp_df %>%
filter(row_number() <= 2)
(tmp_df_2$taken_2 <-
ifelse(c(0, 0), 0,
ifelse(tmp_df_2$logi, min(tmp_df_2$taken), 1)))
## but the following does not work (checks problem is not to do with grouping)
# tmp_df_2 %>%
# mutate(taken_2 = ifelse(n() != 2, 0,
# ifelse(logi, min(taken), 1)),
# not_taken = val - taken_2)
Why is this happening, and how can I obtain the expected behaviour? A workaround is to split the nested ifelse
logic into multiple in-line mutates:
tmp_df %>%
group_by(class) %>%
mutate(taken_2 = ifelse(n() != 2, 0, 1),
taken_3 = taken_2 * ifelse(logi, min(taken), 1),
not_taken = val - taken_3)
Someone else has identified a similar problem with nested ifelse but I don't know whether it has the same root: ifelse using dplyr results in NAs for some records
You are a victim of ifelse
vector-recycling. They key is this line:
mutate(taken_2 = ifelse(n() != 2, 0,
ifelse(logi, min(taken), 1))
Because n() != 2
is length-1 (for each group), ifelse
only considers the first logi
and repeats/recycles this value.
You should use if
and if_else
:
mutate(taken_2 = if (n() != 2) 0 else if_else(logi, min(taken), 1))
I would recommend never to use ifelse
. Take it from someone who almost caused a multi-million dollar error due to this exact bug.
这篇关于为什么嵌套ifelse会在dplyr 0.5.0变异中创建不正确的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!