如何在data.table中最有效地重构字符串fasttime [英] How to most efficiently restructure a character string for fasttime in data.table

查看:163
本文介绍了如何在data.table中最有效地重构字符串fasttime的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table在两列中的字符如:

  01/01/2014 | 00:30 
02/01/2014 | 01:00
03/01/2014 | 01:30 etc

此数据集的长度不同,但每次脚本容易超过30万行运行。最终我知道这个脚本需要处理30,000,000行的数据集。



我目前粘贴以下列形式:

  DT [,DateTime:= paste(Date,Time)



这导致:

  01 / 01/2014 00:30 
02/01/2014 01:00
03/01/2014 01:30 etc

然后,使用 as.POSIXct 将其转换为POSIX日期:

  DT [,DateTime:= as.POSIXct(x = DateTime,format =%d /%m /%Y%H:%M)] 

这工作正常,转换字符正确,很大程度上我相信,因为我设置格式参数匹配字符串的结构它



但是,我想使用 fasttime 包,但是有一个固有的问题它不支持输入的格式参数。因此,当我运行:

  DT [,DateTime:= fastPOSIXct(x = DateTime)] 

fasttime 必须解释我的数据,因为年,月,日,小时,分钟,秒。输出将如下:

  2006/07/07 00:30 
2007/07/07 01: 00
2008/07/07 01:30 etc

使用 as.POSIXct ,或者找到一种方法来处理字符串到正确的顺序。



什么是最有效的方式,让我使用 fasttime ?我应该如何重新排序字符串匹配?你会期望值为了使用 fasttime 重新排序字符串,或者添加的修改字符串的要求使 fasttime 储蓄可以忽略?

解决方案

使用 sub 重新排序你的字符串第一,是的,我认为这将比使用基本 as.POSIXct



$ b

  DT [,DateTime:= fastPOSIXct(sub('(\\d *)/(\\d *)/(\\d *) ','\\3 -\\1 -\\2 \\4',DateTime))] 

你可能还可以使用 substr 而不是正则表达式来加快速度,但是会更麻烦。 p>

I have a data.table with characters in two columns like so:

01/01/2014 | 00:30
02/01/2014 | 01:00
03/01/2014 | 01:30 etc

The length of this data set varies but is easily over 300,000 rows each time the script is run. Eventually I know this script will need to deal with a data set of 30,000,000 rows plus.

I currently paste them in the following form:

DT[, DateTime := paste(Date, Time)

Which leads to:

01/01/2014 00:30
02/01/2014 01:00
03/01/2014 01:30 etc

I then use as.POSIXct to convert that into a POSIX date:

DT[, DateTime:= as.POSIXct(x = DateTime, format = "%d/%m/%Y %H:%M")]

This works fine, converting the characters correctly, largely I believe because I set the format argument to match the structure of the character string it is fed.

However, I'd like to use the fasttime package, but there is an inherent problem in that it does not support a format argument to input. Therefore, when I run:

DT[, DateTime := fastPOSIXct(x = DateTime)]

fasttime has to interpret my data as the "order of interpretation is fixed: year, month, day, hour, minute, second." the output would come out like:

2006/07/07 00:30
2007/07/07 01:00
2008/07/07 01:30 etc

Therfore, it seems I either must use as.POSIXct, or find a way to manipulate the string into the right order.

What would be the most efficient way to allow me to use fasttime? How should I reorder the character string to match? Would you expect that it would be worth reordering the character strings in order to use fasttime, or would the added requirement to correct the strings make fasttime savings negligible?

解决方案

Use sub to reorder your string first, and yes, I think that's going to be much faster than using base as.POSIXct:

DT[, DateTime := fastPOSIXct(sub('(\\d*)/(\\d*)/(\\d*) (.*)', '\\3-\\1-\\2 \\4', DateTime))]

You might also be able to speed this up more using substr instead of regular expressions, but it'll be much messier.

这篇关于如何在data.table中最有效地重构字符串fasttime的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆