R中具有NA的日期列-带有突变的意外行为 [英] Date columns with NAs in R - unexpected behaviour with mutate
问题描述
我正在尝试对数据集进行此过程。
这是一个测试数据帧:
I'm trying to follow this process with a dataset. Here is a test dataframe:
id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))
df <- data.frame(id, orderno, validorder, ordertype, orderdate)
然后我执行以下操作:
Then I do the following:
## compute order date for order types
df <- df %>%
mutate(orderdate_dried = if_else(validorder == 1 &
ordertype == 95,
orderdate, as.Date(NA)),
orderdate_fresh = if_else(validorder == 1 &
ordertype == 94,
orderdate, as.Date(NA)))
## take minimum order date by type by order number
df <- df %>%
group_by(id, orderno) %>%
mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
ungroup()
## aggregate order date for each type over individual
df <- df %>%
group_by(id) %>%
mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
ungroup()
但是此过程结束时的所有最大日期均为NA!我不明白怎么办?此外,如果我测试原始的 orderdate_dried
是否有NA:
But all the maximum dates at the end of this process are NA! I don't understand how? Further, if I test the original orderdate_dried
for NAs:
is.na(df$orderdate_dried)
我得到每一行的NA!
I get NAs for each row! How is this happening?!
推荐答案
非常有趣的问题,答案隐藏在问题本身中。为了清楚起见,我每次使用 df1
, df2 $时都不必更新相同的
df
c $ c>等。
Very interesting question and the answer is hidden in the question itself. For clarity instead of updating the same df
everytime I will use df1
, df2
etc.
首先从数据开始。
id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))
df <- data.frame(id, orderno, validorder, ordertype, orderdate)
library(dplyr)
步骤1-
df1 <- df %>%
mutate(orderdate_dried = if_else(validorder == 1 &
ordertype == 95,
orderdate, as.Date(NA)),
orderdate_fresh = if_else(validorder == 1 &
ordertype == 94,
orderdate, as.Date(NA)))
df1
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh
#1 Johnboy 2 0 95 2019-06-17 <NA> <NA>
#2 Johnboy 2 1 94 2019-03-26 <NA> 2019-03-26
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 <NA>
这里期望的一切。
第2步-
df2 <- df1 %>%
group_by(id, orderno) %>%
mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
ungroup()
df2
# A tibble: 3 x 7
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh
# <fct> <dbl> <dbl> <dbl> <date> <date> <date>
#1 Johnboy 2 0 95 2019-06-17 NA 2019-03-26
#2 Johnboy 2 1 94 2019-03-26 NA 2019-03-26
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 NA
这里的一切似乎也都像预期的那样,当组中没有其他日期时,我们将得到 NA
。
Everything seems as expected here as well, we get NA
when there is no other date in the group.
步骤3-
df3 <- df2 %>%
group_by(id) %>%
mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
ungroup()
df3
# A tibble: 3 x 9
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
# <fct> <dbl> <dbl> <dbl> <date> <date> <date> <date> <date>
#1 Johnboy 2 0 95 2019-06-17 NA 2019-03-26 NA NA
#2 Johnboy 2 1 94 2019-03-26 NA 2019-03-26 NA NA
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 NA NA NA
一切似乎在这里是错误的。这些基本上与您执行的步骤相同,并且您将获得相同的输出,因此直到这里我们都没有做过任何不同的事情。
Everything seems to be wrong here. These are basically the same steps that you have performed and this is the same output that you are getting, so we haven't done anything different till here.
我们错过了,尽管在步骤2中,我们收到了警告消息。
One thing which we have missed though is in step 2 we received a warning message.
警告消息:
1:以min.default(c(NA_real_,NA_real_),na.rm = TRUE)表示:
没有对min必不可少的论点;返回Inf
2:在min.default(NA_real_,na.rm = TRUE)中:
没有对min的必填参数;返回Inf
Warning messages: 1: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In min.default(NA_real_, na.rm = TRUE) : no non-missing arguments to min; returning Inf
因为我们在一个组中没有非NA值,所以它返回了 Inf
即使 df2
的输出显示NA(为什么值 NA
> Inf 在答案末尾添加了对此的解释)。因此,即使您使用它测试 is.na
,它也会失败。
Because we had no non-NA value in a group it returned Inf
even though the output of df2
shows NA (why it shows NA
when the value is Inf
added the explanation for it at the end of the answer). So even if you test is.na
with it, it fails.
is.na(df2$orderdate_dried)
#[1] FALSE FALSE FALSE
因此, max
与 na.rm
也失败。
max(df2$orderdate_dried, na.rm = TRUE)
#[1] NA
因此,您在步骤3中会得到所有 NA
。
Hence, you get all NA
s in step 3.
解决方案
解决方案是用 is.finite
df3 <- df2 %>%
group_by(id) %>%
mutate(max_orderdate_dried = max(orderdate_dried[is.finite(orderdate_dried)], na.rm=TRUE),
max_orderdate_fresh = max(orderdate_fresh[is.finite(orderdate_fresh)], na.rm=TRUE)) %>%
ungroup()
df3
# A tibble: 3 x 9
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
# <fct> <dbl> <dbl> <dbl> <date> <date> <date> <date> <date>
#1 Johnboy 2 0 95 2019-06-17 NA 2019-03-26 2018-08-23 2019-03-26
#2 Johnboy 2 1 94 2019-03-26 NA 2019-03-26 2018-08-23 2019-03-26
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 NA 2018-08-23 2019-03-26
为什么将值显示为 NA
当值是 Inf
吗?
Why does it show value as NA
when the value is Inf
?
在步骤2中,基本上是在做
In step 2, what we are basically doing is
min(NA, na.rm = TRUE)
#[1] Inf
警告消息:
in min(NA,na.rm = TRUE ):没有min的必填参数;返回Inf
Warning message: In min(NA, na.rm = TRUE) : no non-missing arguments to min; returning Inf
这将返回 Inf
并给出警告。
This returns Inf
with a warning which we get.
但是,由于我们知道一列只能容纳一个类
的值。
However, since we know that a column can hold a value of only one class
.
class(Inf) #is
#[1] "numeric"
,但是我们在 df1
的<$ c $中有日期类的数据c> orderdate_dried 列
class(df1$orderdate_dried)
#[1] "Date"
因此将 Inf
强制转换为返回的日期类。
so Inf
is then coerced into class "Date" which returns.
as.Date(min(NA, na.rm = TRUE))
#[1] NA
同样,这是返回 NA
,但这不是真实的 NA
和 is.na
在此失败
Again this is returns NA
but it is not real NA
and is.na
fails on this
is.na(as.Date(min(NA, na.rm = TRUE)))
#[1] FALSE
因此,第3步无法正常工作。
hence, step 3 doesn't work as expected.
我希望这个答案是明确的,不要太混乱。
I hope this answer is clear and not too confusing.
这篇关于R中具有NA的日期列-带有突变的意外行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!