当将文件读入数据框时,会自动检测日期列 [英] automatically detect date columns when reading a file into a data.frame

查看:131
本文介绍了当将文件读入数据框时,会自动检测日期列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当读取文件时, read.table 函数使用 type.convert 来区分逻辑,整数,数字,复杂或因子列,并相应存储。

When reading a file, the read.table function uses type.convert to distinguish between logical, integer, numeric, complex, or factor columns and store them accordingly.

我想添加日期到混合,以便包含日期的列可以自动被识别并解析成日期对象。应该只识别几个日期格式,例如。

I'd like to add dates to the mix, so that columns containing dates can automatically be recognized and parsed into Date objects. Only a few date formats should be recognized, e.g.

date.formats <- c("%m/%d/%Y", "%Y/%m/%d")

这是一个例子:

fh <- textConnection(

 "num  char date-format1  date-format2  not-all-dates  not-same-formats
   10     a     1/1/2013    2013/01/01     2013/01/01          1/1/2013
   20     b     2/1/2013    2013/02/01              a        2013/02/01 
   30     c     3/1/2013            NA              b          3/1/2013"
)

dat <- my.read.table(fh, header = TRUE, stringsAsFactors = FALSE,
                     date.formats = date.formats)
sapply(dat, class)

p>

would give:

num              => numeric
char             => character
date-format1     => Date
date-format2     => Date
not-all-dates    => character
not-same-formats => character   # not a typo: date format must be consistent

在我从头开始实现之前,类似这样的东西已经在包中可用了?或者也许有人已经给了它一个裂缝(或将),并愿意在这里分享他的代码?谢谢。

Before I go and implement it from scratch, is something like this already available in a package? Or maybe someone already gave it a crack (or will) and is willing to share his code here? Thank you.

推荐答案

这里我一起快速投掷。由于 as.Date 函数不够严格,因此不正确处理最后一列(请参阅 as.Date(1/1/2013 ,%Y /%m /%d) parses ok例如...)

Here I threw one together quickly. It is not handling the last column properly because the as.Date function is not strict enough (see that as.Date("1/1/2013", "%Y/%m/%d") parses ok for example...)

my.read.table <- function(..., date.formats = c("%m/%d/%Y", "%Y/%m/%d")) {
   dat <- read.table(...)
   for (col.idx in seq_len(ncol(dat))) {
      x <- dat[, col.idx]
      if(!is.character(x) | is.factor(x)) next
      if (all(is.na(x))) next
      for (f in date.formats) {
         d <- as.Date(as.character(x), f)
         if (any(is.na(d[!is.na(x)]))) next
         dat[, col.idx] <- d         
      }
   }
   dat
}

dat <- my.read.table(fh, header = TRUE, stringsAsFactors = FALSE)
as.data.frame(sapply(dat, class))

#                  sapply(dat, class)
# num                         integer
# char                      character
# date.format1                   Date
# date.format2                   Date
# not.all.dates             character
# not.same.formats               Date

如果您知道一种方法来解析格式比 as.Date 更严格的日期(请参见上面的示例),请让我知道。

If you know a way to parse dates that is more strict around formats than as.Date (see the example above), please let me know.

修改:为了使日期解析超级严格,我可以添加

Edit: To make the date parsing super strict, I can add

if (!identical(x, format(d, f))) next

为了正常工作,我需要所有输入日期才能在需要的地方有前导零,即 01/01/2013 而不是code> 1/1/2013 。如果这是标准方式,我可以生活。

For it to work, I will need all my input dates to have leading zeroes where needed, i.e. 01/01/2013 and not 1/1/2013. I can live with that if that's the standard way.

这篇关于当将文件读入数据框时,会自动检测日期列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆