为什么dplyr删除值不符合条件? [英] Why is dplyr removing values not met by condition?
问题描述
我正在使用 dplyr
将值
替换为 NA
如果满足条件,但是将 NA
放在不应该的位置。
I'm using dplyr
to replace the value
with NA
if a condition is met, but it's putting NA
in place where it shouldn't be.
dput:
df <- structure(list(id = c("USC00231275", "USC00231275", "USC00231275",
"USC00231275", "USC00231275", "USC00231275", "USC00231275", "USC00231275",
"USC00231275", "USC00231275"), element = c("TMAX", "TMIN", "TMAX",
"TMIN", "TMAX", "TMIN", "TMAX", "TMIN", "TMAX", "TMIN"), year = c(1937,
1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937), month = c(5,
5, 5, 5, 5, 5, 5, 5, 5, 5), day = c(1, 1, 2, 2, 3, 3, 4, 4, 5,
5), date = structure(c(-11933, -11933, -11932, -11932, -11931,
-11931, -11930, -11930, -11929, -11929), class = "Date"), value = c(0,
53.96, 68, 44.96, 62.06, 53.96, 73.04, 53.96, 69.08, 50)), .Names = c("id",
"element", "year", "month", "day", "date", "value"), row.names = c(NA,
10L), class = "data.frame")
data.frame
(注:条件仅在第1行和第2行满足)
data.frame
(Note: condition is only met on row 1 and 2)
id element year month day date value
1 USC00231275 TMAX 1937 5 1 1937-05-01 0.00
2 USC00231275 TMIN 1937 5 1 1937-05-01 53.96
3 USC00231275 TMAX 1937 5 2 1937-05-02 68.00
4 USC00231275 TMIN 1937 5 2 1937-05-02 44.96
5 USC00231275 TMAX 1937 5 3 1937-05-03 62.06
6 USC00231275 TMIN 1937 5 3 1937-05-03 53.96
7 USC00231275 TMAX 1937 5 4 1937-05-04 73.04
8 USC00231275 TMIN 1937 5 4 1937-05-04 53.96
9 USC00231275 TMAX 1937 5 5 1937-05-05 69.08
10 USC00231275 TMIN 1937 5 5 1937-05-05 50.00
dplyr
df %>%
group_by(date) %>%
mutate(
value = if(value[element == 'TMIN'] >= value[element == 'TMAX'])
as.numeric(NA) else value
)
id element year month day date value
(chr) (chr) (dbl) (dbl) (dbl) (date) (dbl)
1 USC00231275 TMAX 1937 5 1 1937-05-01 NA
2 USC00231275 TMIN 1937 5 1 1937-05-01 NA
3 USC00231275 TMAX 1937 5 2 1937-05-02 68.00
4 USC00231275 TMIN 1937 5 2 1937-05-02 44.96
5 USC00231275 TMAX 1937 5 3 1937-05-03 NA
6 USC00231275 TMIN 1937 5 3 1937-05-03 NA
7 USC00231275 TMAX 1937 5 4 1937-05-04 73.04
8 USC00231275 TMIN 1937 5 4 1937-05-04 53.96
9 USC00231275 TMAX 1937 5 5 1937-05-05 69.08
10 USC00231275 TMIN 1937 5 5 1937-05-05 50.00
请注意,应该更改的唯一行是 1
和 2
,但 dplyr
更改行 5
和 6
即使条件不符合。
Notice that the only rows that should change are 1
and 2
, but dplyr
changed rows 5
and 6
even though the conditions were not met.
推荐答案
下面的代码应该做你想做的事情。
The code below should do what you are trying to do
df %>%
group_by(date) %>%
mutate(new_value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
ungroup
For the question of whether this is a bug or not, I don't think it is. Looking at just the data for the one year, where TMIN >= TMAX, you have the following
df %>%
filter(date == '1937-05-01') %>%
mutate(res = (value[element == 'TMIN'] >= value[element == 'TMAX'])) %>%
mutate(new_value = ifelse( (res & element=='TMIN'), NA, value))
id element year month day date value res new_value
1 USC00231275 TMAX 1937 5 1 1937-05-01 0.00 TRUE 0
2 USC00231275 TMIN 1937 5 1 1937-05-01 53.96 TRUE NA
结构 value [element =='TMIN']> = value [element =='TMAX'])
将始终为真,如 res
列中所示。下面的代码打破了这一点,希望澄清(我希望)。
The construct value[element == 'TMIN'] >= value[element == 'TMAX'])
will always be true as can be seen in the res
column. The code below breaks this down a bit to hopefully clarify (I hope).
### Just looking at one date
> df2 <- df %>% filter(date == '1937-05-01')
> df2
id element year month day date value
1 USC00231275 TMAX 1937 5 1 1937-05-01 0.00
2 USC00231275 TMIN 1937 5 1 1937-05-01 53.96
### This comparison will be recycled for every element in the group,
### so it will always be TRUE or always FALSE.
> c(df2$value[df2$element == 'TMIN'], df2$value[df2$element == 'TMAX'])
[1] 53.96 0.00
由于整个群组有一个比较,他们总是会看到TRUE或者总是FALSE。
Since there is one comparison for the entire group, they will always see TRUE or always FALSE.
提供正确结果的代码显示了如何进行比较。
The code that gives the correct result shows how the comparison can be gotten around.
一个可能的最终解决方案可能是:
One possible final solution could be:
df %>%
group_by(date) %>%
mutate(value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
ungroup
这篇关于为什么dplyr删除值不符合条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!