R中具有NA的日期列-带有突变的意外行为 [英] Date columns with NAs in R - unexpected behaviour with mutate

查看:84
本文介绍了R中具有NA的日期列-带有突变的意外行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对数据集进行此过程。
这是一个测试数据帧:

I'm trying to follow this process with a dataset. Here is a test dataframe:

id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))

df <- data.frame(id, orderno, validorder, ordertype, orderdate)

然后我执行以下操作:

Then I do the following:

## compute order date for order types
df <- df %>%
  mutate(orderdate_dried = if_else(validorder == 1 &
                                  ordertype == 95,
                                  orderdate, as.Date(NA)),
         orderdate_fresh = if_else(validorder == 1 &
                                  ordertype == 94,
                                  orderdate, as.Date(NA)))

## take minimum order date by type by order number
df <- df %>%
  group_by(id, orderno) %>%
  mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
         orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
  ungroup()

## aggregate order date for each type over individual
df <- df %>%
  group_by(id) %>%
  mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
         max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
  ungroup()

但是此过程结束时的所有最大日期均为NA!我不明白怎么办?此外,如果我测试原始的 orderdate_dried 是否有NA:

But all the maximum dates at the end of this process are NA! I don't understand how? Further, if I test the original orderdate_dried for NAs:

is.na(df$orderdate_dried)

我得到每一行的NA!

I get NAs for each row! How is this happening?!

推荐答案

非常有趣的问题,答案隐藏在问题本身中。为了清楚起见,我每次使用 df1 df2 df c $ c>等。

Very interesting question and the answer is hidden in the question itself. For clarity instead of updating the same df everytime I will use df1, df2 etc.

首先从数据开始。

id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))

df <- data.frame(id, orderno, validorder, ordertype, orderdate)

library(dplyr)

步骤1-

df1 <- df %>%
        mutate(orderdate_dried = if_else(validorder == 1 &
                                         ordertype == 95,
                                        orderdate, as.Date(NA)),
               orderdate_fresh = if_else(validorder == 1 &
                                         ordertype == 94,
                                         orderdate, as.Date(NA)))

df1
#       id orderno validorder ordertype  orderdate orderdate_dried orderdate_fresh
#1 Johnboy       2          0        95 2019-06-17            <NA>            <NA>
#2 Johnboy       2          1        94 2019-03-26            <NA>      2019-03-26
#3 Johnboy       1          1        95 2018-08-23      2018-08-23            <NA>

这里期望的一切。

第2步-

df2 <- df1 %>%
        group_by(id, orderno) %>%
        mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
                orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
        ungroup()

df2
# A tibble: 3 x 7
#  id      orderno validorder ordertype orderdate  orderdate_dried orderdate_fresh
#  <fct>     <dbl>      <dbl>     <dbl> <date>     <date>          <date>         
#1 Johnboy       2          0        95 2019-06-17 NA              2019-03-26     
#2 Johnboy       2          1        94 2019-03-26 NA              2019-03-26     
#3 Johnboy       1          1        95 2018-08-23 2018-08-23      NA           

这里的一切似乎也都像预期的那样,当组中没有其他日期时,我们将得到 NA

Everything seems as expected here as well, we get NA when there is no other date in the group.

步骤3-

df3 <- df2 %>%
        group_by(id) %>%
        mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
               max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
         ungroup()

df3
# A tibble: 3 x 9
#  id      orderno validorder ordertype orderdate  orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
 #  <fct>     <dbl>      <dbl>     <dbl> <date>     <date>          <date>          <date>              <date>             
#1 Johnboy       2          0        95 2019-06-17 NA              2019-03-26      NA                  NA                 
#2 Johnboy       2          1        94 2019-03-26 NA              2019-03-26      NA                  NA                 
#3 Johnboy       1          1        95 2018-08-23 2018-08-23      NA              NA                  NA    

一切似乎在这里是错误的。这些基本上与您执行的步骤相同,并且您将获得相同的输出,因此直到这里我们都没有做过任何不同的事情。

Everything seems to be wrong here. These are basically the same steps that you have performed and this is the same output that you are getting, so we haven't done anything different till here.

我们错过了,尽管在步骤2中,我们收到了警告消息。

One thing which we have missed though is in step 2 we received a warning message.


警告消息:
1:以min.default(c(NA_real_,NA_real_),na.rm = TRUE)表示:
没有对min必不可少的论点;返回Inf
2:在min.default(NA_real_,na.rm = TRUE)中:
没有对min的必填参数;返回Inf

Warning messages: 1: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In min.default(NA_real_, na.rm = TRUE) : no non-missing arguments to min; returning Inf

因为我们在一个组中没有非NA值,所以它返回了 Inf 即使 df2 的输出显示NA(为什么值 NA > Inf 在答案末尾添加了对此的解释)。因此,即使您使用它测试 is.na ,它也会失败。

Because we had no non-NA value in a group it returned Inf even though the output of df2 shows NA (why it shows NA when the value is Inf added the explanation for it at the end of the answer). So even if you test is.na with it, it fails.

is.na(df2$orderdate_dried)
#[1] FALSE FALSE FALSE

因此, max na.rm 也失败。

 max(df2$orderdate_dried, na.rm = TRUE)
#[1] NA

因此,您在步骤3中会得到所有 NA

Hence, you get all NAs in step 3.

解决方案

解决方案是用 is.finite

df3 <- df2 %>%
        group_by(id) %>%
         mutate(max_orderdate_dried = max(orderdate_dried[is.finite(orderdate_dried)], na.rm=TRUE),
                 max_orderdate_fresh = max(orderdate_fresh[is.finite(orderdate_fresh)], na.rm=TRUE)) %>%
         ungroup()


df3
# A tibble: 3 x 9
#  id      orderno validorder ordertype orderdate  orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
#  <fct>     <dbl>      <dbl>     <dbl> <date>     <date>          <date>          <date>              <date>             
#1 Johnboy       2          0        95 2019-06-17 NA              2019-03-26      2018-08-23          2019-03-26         
#2 Johnboy       2          1        94 2019-03-26 NA              2019-03-26      2018-08-23          2019-03-26         
#3 Johnboy       1          1        95 2018-08-23 2018-08-23      NA              2018-08-23          2019-03-26   






为什么将值显示为 NA 当值是 Inf 吗?


Why does it show value as NA when the value is Inf ?

在步骤2中,基本上是在做

In step 2, what we are basically doing is

min(NA, na.rm = TRUE)
#[1] Inf




警告消息:
in min(NA,na.rm = TRUE ):没有min的必填参数;返回Inf

Warning message: In min(NA, na.rm = TRUE) : no non-missing arguments to min; returning Inf

这将返回 Inf 并给出警告。

This returns Inf with a warning which we get.

但是,由于我们知道一列只能容纳一个的值。

However, since we know that a column can hold a value of only one class.

class(Inf) #is
#[1] "numeric"

,但是我们在 df1 的<$ c $中有日期类的数据c> orderdate_dried 列

 class(df1$orderdate_dried)
#[1] "Date"

因此将 Inf 强制转换为返回的日期类。

so Inf is then coerced into class "Date" which returns.

as.Date(min(NA, na.rm = TRUE))
#[1] NA

同样,这是返回 NA ,但这不是真实的 NA is.na 在此失败

Again this is returns NA but it is not real NA and is.na fails on this

is.na(as.Date(min(NA, na.rm = TRUE)))
#[1] FALSE

因此,第3步无法正常工作。

hence, step 3 doesn't work as expected.

我希望这个答案是明确的,不要太混乱。

I hope this answer is clear and not too confusing.

这篇关于R中具有NA的日期列-带有突变的意外行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆