为什么嵌套ifelse会在dplyr 0.5.0变异中创建不正确的结果? [英] Why does nested ifelse create incorrect results in dplyr 0.5.0 mutate?

查看:111
本文介绍了为什么嵌套ifelse会在dplyr 0.5.0变异中创建不正确的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下数据框架:

 (tmp_df<  -  
结构(list(class = c 0L,0L,1L,1L,2L,2L),logi = c(TRUE,
FALSE,TRUE,FALSE,TRUE,FALSE),val = c(1,1,1,1,1,1) ,
taken = c(1.00684931506849,0.333197278911565,1.025,0.975609756097561,
1.00826446280992,0.991803278688525)),class = c(tbl_df,
tbl,data.frame), row.names = c(NA,-6L),.Names = c(class,
logi,val,taken)))

创建:

 源:本地数据框[6 x 4] 

class logi val
< int> < LGL> < DBL> < DBL>
1 0 TRUE 1 1.0068493
2 0 FALSE 1 0.9931973
3 1 TRUE 1 1.0250000
4 1 FALSE 1 0.9756098
5 2 TRUE 1 1.0082645
6 2 FALSE 1 0.9918033

我希望按类别分组,如果每个组包含两个成员, 1从 val 如果 logi == FALSE ,否则减去 / code>在 val 中。如果每个组不包含两个成员,那么我们从 val 中减去零。



使用 dplyr 可以使用以下表达方式:

  tmp_df%>%
group_by(class)%>%
mutate(taken_2 = ifelse(n()!= 2,0,
ifelse(logi,min(taken),1)),
not_taken = val - taken_2)

但是,这会产生不正确的结果,第二个 ifelse 始终解决第一个条件

 来源:本地数据框架[6 x 6] 
组:class [3]

class logi val taken taken_2 not_taken
< int> < LGL> < DBL> < DBL> < DBL> < DBL>
1 0 TRUE 1 1.0068493 0.9931973 0.006802721
2 0 FALSE 1 0.9931973 0.9931973 0.006802721
3 1 TRUE 1 1.0250000 0.9756098 0.024390244
4 1 FALSE 1 0.9756098 0.9756098 0.024390244
5 2 TRUE 1 1.0082645 0.9918033 0.008196721
6 2 FALSE 1 0.9918033 0.9918033 0.008196721

正确的结果可以是如果我们没有第一个 ifelse 语句,则生成。

  tmp_df% >%
group_by(class)%>%
mutate(taken_2 = ifelse(logi,min(已取),1),
not_taken = val - taken_2)

生产:

 来源:本地数据框架[6 x 6] 
组:class [3]

class logi val taken taken_2 not_taken
< int> < LGL> < DBL> < DBL> < DBL> < DBL>
1 0 TRUE 1 1.0068493 0.9931973 0.006802721
2 0 FALSE 1 0.9931973 1.0000000 0.000000000#正确!
3 1 TRUE 1 1.0250000 0.9756098 0.024390244
4 1 FALSE 1 0.9756098 1.0000000 0.000000000#正确!
5 2 TRUE 1 1.0082645 0.9918033 0.008196721
6 2 FALSE 1 0.9918033 1.0000000 0.000000000#正确!

我们可以看到这个问题似乎被隔离到 mutate 和嵌套的 ifelse ,检查其他成功执行类似操作的代码片段:

  tmp_df%>%
group_by(class)%>%
mutate(taken_2 = ifelse(n()!= 3,0,
ifelse(logi, min(take),1)),
not_taken = val - taken_2)

tmp_df_2< -
tmp_df%>%
filter(row_number ; = 2)

(tmp_df_2 $ taken_2 < -
ifelse(c(0,0),0,
ifelse(tmp_df_2 $ logi,min(tmp_df_2 $ taken) ,1)))

##但以下不起作用(检查问题与分组无关)
#tmp_df_2%>%
#mutate(taken_2 = ifelse(n()!= 2,0,
#ifelse(logi,min(taken),1)),
#not_taken = val - taken_2)

为什么会发生这种情况?获得预期的行为?解决方法是将嵌套的 ifelse 逻辑拆分为多个内嵌变异:

  tmp_df%>%
group_by(class)%>%
mutate(taken_2 = ifelse(n()!= 2,0,1),
taken_3 = taken_2 * ifelse(logi,min(已取),1),
not_taken = val - taken_3)

其他人已经确定了嵌套ifelse的类似问题,但我不知道它是否具有相同的根:

您是 ifelse 矢量循环的受害者。他们的关键是这一行:

  mutate(taken_2 = ifelse(n()!= 2,0,
ifelse (logi,min(take),1))

因为 n() != 2 是length-1(对于每个组), ifelse 仅考虑第一个 logi 并重复/回收此值。



如果 if_else

  mutate(taken_2 = if(n()!= 2)0 else if_else(logi, min(take),1))

我建议永不使用 ifelse 。由于这个确切的错误,几乎造成了数百万美元错误的人。从

Consider the following data frame:

(tmp_df <-
structure(list(class = c(0L, 0L, 1L, 1L, 2L, 2L), logi = c(TRUE, 
FALSE, TRUE, FALSE, TRUE, FALSE), val = c(1, 1, 1, 1, 1, 1), 
    taken = c(1.00684931506849, 0.993197278911565, 1.025, 0.975609756097561, 
    1.00826446280992, 0.991803278688525)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("class", 
"logi", "val", "taken")))

which creates:

Source: local data frame [6 x 4]

  class  logi   val     taken
  <int> <lgl> <dbl>     <dbl>
1     0  TRUE     1 1.0068493
2     0 FALSE     1 0.9931973
3     1  TRUE     1 1.0250000
4     1 FALSE     1 0.9756098
5     2  TRUE     1 1.0082645
6     2 FALSE     1 0.9918033

I wish to group by class, and if each group contains two members, then subtract 1 from val if logi == FALSE, otherwise, subtract the minimum value of taken in that group from val. If each group does not contain two members, then we subtract zero from val.

Code using dplyr package to do the above can be expressed using:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(n() != 2, 0, 
                              ifelse(logi, min(taken), 1)),
           not_taken = val - taken_2)

However, this produces the incorrect result, where by the second ifelse always resolves to the first condition:

Source: local data frame [6 x 6]
Groups: class [3]

  class  logi   val     taken   taken_2   not_taken
  <int> <lgl> <dbl>     <dbl>     <dbl>       <dbl>
1     0  TRUE     1 1.0068493 0.9931973 0.006802721
2     0 FALSE     1 0.9931973 0.9931973 0.006802721
3     1  TRUE     1 1.0250000 0.9756098 0.024390244
4     1 FALSE     1 0.9756098 0.9756098 0.024390244
5     2  TRUE     1 1.0082645 0.9918033 0.008196721
6     2 FALSE     1 0.9918033 0.9918033 0.008196721

The correct result can be produced if we do not have the first ifelse statement.

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(logi, min(taken), 1),
           not_taken = val - taken_2)

producing:

Source: local data frame [6 x 6]
Groups: class [3]

  class  logi   val     taken   taken_2   not_taken
  <int> <lgl> <dbl>     <dbl>     <dbl>       <dbl>
1     0  TRUE     1 1.0068493 0.9931973 0.006802721
2     0 FALSE     1 0.9931973 1.0000000 0.000000000 # correct!
3     1  TRUE     1 1.0250000 0.9756098 0.024390244
4     1 FALSE     1 0.9756098 1.0000000 0.000000000 # correct!
5     2  TRUE     1 1.0082645 0.9918033 0.008196721
6     2 FALSE     1 0.9918033 1.0000000 0.000000000 # correct!

We can see that this problem seems to be isolated to mutate and the nested ifelse by examining other code fragments that successfully do similar stuff:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(n() != 3, 0, 
                            ifelse(logi, min(taken), 1)),
           not_taken = val - taken_2)

tmp_df_2 <-
    tmp_df %>%
    filter(row_number() <= 2)

(tmp_df_2$taken_2 <-
    ifelse(c(0, 0), 0, 
           ifelse(tmp_df_2$logi, min(tmp_df_2$taken), 1)))

## but the following does not work (checks problem is not to do with grouping)
# tmp_df_2 %>%
#     mutate(taken_2 = ifelse(n() != 2, 0, 
#                             ifelse(logi, min(taken), 1)),
#            not_taken = val - taken_2)

Why is this happening, and how can I obtain the expected behaviour? A workaround is to split the nested ifelse logic into multiple in-line mutates:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(n() != 2, 0, 1),
           taken_3 = taken_2 * ifelse(logi, min(taken), 1),
           not_taken = val - taken_3)

Someone else has identified a similar problem with nested ifelse but I don't know whether it has the same root: ifelse using dplyr results in NAs for some records

解决方案

You are a victim of ifelse vector-recycling. They key is this line:

mutate(taken_2 = ifelse(n() != 2, 0, 
                          ifelse(logi, min(taken), 1))

Because n() != 2 is length-1 (for each group), ifelse only considers the first logi and repeats/recycles this value.

You should use if and if_else:

mutate(taken_2 = if (n() != 2) 0 else if_else(logi, min(taken), 1))

I would recommend never to use ifelse. Take it from someone who almost caused a multi-million dollar error due to this exact bug.

这篇关于为什么嵌套ifelse会在dplyr 0.5.0变异中创建不正确的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆