为什么dplyr删除值不符合条件? [英] Why is dplyr removing values not met by condition?

查看:121
本文介绍了为什么dplyr删除值不符合条件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 dplyr 替换为 NA 如果满足条件,但是将 NA 放在不应该的位置。

I'm using dplyr to replace the value with NA if a condition is met, but it's putting NA in place where it shouldn't be.

dput:

df <- structure(list(id = c("USC00231275", "USC00231275", "USC00231275", 
"USC00231275", "USC00231275", "USC00231275", "USC00231275", "USC00231275", 
"USC00231275", "USC00231275"), element = c("TMAX", "TMIN", "TMAX", 
"TMIN", "TMAX", "TMIN", "TMAX", "TMIN", "TMAX", "TMIN"), year = c(1937, 
1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937), month = c(5, 
5, 5, 5, 5, 5, 5, 5, 5, 5), day = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 
5), date = structure(c(-11933, -11933, -11932, -11932, -11931, 
-11931, -11930, -11930, -11929, -11929), class = "Date"), value = c(0, 
53.96, 68, 44.96, 62.06, 53.96, 73.04, 53.96, 69.08, 50)), .Names = c("id", 
"element", "year", "month", "day", "date", "value"), row.names = c(NA, 
10L), class = "data.frame")

data.frame (注:条件仅在第1行和第2行满足)

data.frame (Note: condition is only met on row 1 and 2)

            id element year month day       date value
1  USC00231275    TMAX 1937     5   1 1937-05-01  0.00
2  USC00231275    TMIN 1937     5   1 1937-05-01 53.96
3  USC00231275    TMAX 1937     5   2 1937-05-02 68.00
4  USC00231275    TMIN 1937     5   2 1937-05-02 44.96
5  USC00231275    TMAX 1937     5   3 1937-05-03 62.06
6  USC00231275    TMIN 1937     5   3 1937-05-03 53.96
7  USC00231275    TMAX 1937     5   4 1937-05-04 73.04
8  USC00231275    TMIN 1937     5   4 1937-05-04 53.96
9  USC00231275    TMAX 1937     5   5 1937-05-05 69.08
10 USC00231275    TMIN 1937     5   5 1937-05-05 50.00

dplyr

df %>%
  group_by(date) %>%
  mutate(
    value = if(value[element == 'TMIN'] >= value[element == 'TMAX'])
      as.numeric(NA) else value
  )

            id element  year month   day       date value
         (chr)   (chr) (dbl) (dbl) (dbl)     (date) (dbl)
1  USC00231275    TMAX  1937     5     1 1937-05-01    NA
2  USC00231275    TMIN  1937     5     1 1937-05-01    NA
3  USC00231275    TMAX  1937     5     2 1937-05-02 68.00
4  USC00231275    TMIN  1937     5     2 1937-05-02 44.96
5  USC00231275    TMAX  1937     5     3 1937-05-03    NA
6  USC00231275    TMIN  1937     5     3 1937-05-03    NA
7  USC00231275    TMAX  1937     5     4 1937-05-04 73.04
8  USC00231275    TMIN  1937     5     4 1937-05-04 53.96
9  USC00231275    TMAX  1937     5     5 1937-05-05 69.08
10 USC00231275    TMIN  1937     5     5 1937-05-05 50.00

请注意,应该更改的唯一行是 1 2 ,但 dplyr 更改行 5 6 即使条件不符合。

Notice that the only rows that should change are 1 and 2, but dplyr changed rows 5 and 6 even though the conditions were not met.

推荐答案

下面的代码应该做你想做的事情。

The code below should do what you are trying to do

df %>%
  group_by(date) %>%
  mutate(new_value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
  ungroup



For the question of whether this is a bug or not, I don't think it is. Looking at just the data for the one year, where TMIN >= TMAX, you have the following

df %>%
  filter(date == '1937-05-01') %>%
  mutate(res = (value[element == 'TMIN'] >= value[element == 'TMAX'])) %>%
  mutate(new_value = ifelse( (res & element=='TMIN'), NA, value))

           id element year month day       date value  res new_value
1 USC00231275    TMAX 1937     5   1 1937-05-01  0.00 TRUE         0
2 USC00231275    TMIN 1937     5   1 1937-05-01 53.96 TRUE        NA

结构 value [element =='TMIN']> = value [element =='TMAX'])将始终为真,如 res 列中所示。下面的代码打破了这一点,希望澄清(我希望)。

The construct value[element == 'TMIN'] >= value[element == 'TMAX']) will always be true as can be seen in the res column. The code below breaks this down a bit to hopefully clarify (I hope).

### Just looking at one date
> df2 <- df %>% filter(date == '1937-05-01')
> df2
           id element year month day       date value
1 USC00231275    TMAX 1937     5   1 1937-05-01  0.00
2 USC00231275    TMIN 1937     5   1 1937-05-01 53.96

### This comparison will be recycled for every element in the group,
### so it will always be TRUE or always FALSE.
> c(df2$value[df2$element == 'TMIN'], df2$value[df2$element == 'TMAX'])
[1] 53.96  0.00

由于整个群组有一个比较,他们总是会看到TRUE或者总是FALSE。

Since there is one comparison for the entire group, they will always see TRUE or always FALSE.

提供正确结果的代码显示了如何进行比较。

The code that gives the correct result shows how the comparison can be gotten around.

一个可能的最终解决方案可能是:

One possible final solution could be:

df %>%
   group_by(date) %>%
   mutate(value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
   ungroup

这篇关于为什么dplyr删除值不符合条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆