如何在data.table中最有效地重构字符串fasttime [英] How to most efficiently restructure a character string for fasttime in data.table
问题描述
我有一个data.table在两列中的字符如:
01/01/2014 | 00:30
02/01/2014 | 01:00
03/01/2014 | 01:30 etc
此数据集的长度不同,但每次脚本容易超过30万行运行。最终我知道这个脚本需要处理30,000,000行的数据集。
我目前粘贴以下列形式:
DT [,DateTime:= paste(Date,Time)
这导致:
01 / 01/2014 00:30
02/01/2014 01:00
03/01/2014 01:30 etc
然后,使用
as.POSIXct
将其转换为POSIX日期:DT [,DateTime:= as.POSIXct(x = DateTime,format =%d /%m /%Y%H:%M)]
这工作正常,转换字符正确,很大程度上我相信,因为我设置格式参数匹配字符串的结构它
但是,我想使用
fasttime
包,但是有一个固有的问题它不支持输入的格式
参数。因此,当我运行:DT [,DateTime:= fastPOSIXct(x = DateTime)]
fasttime
必须解释我的数据,因为年,月,日,小时,分钟,秒。输出将如下:2006/07/07 00:30
2007/07/07 01: 00
2008/07/07 01:30 etc
使用
as.POSIXct
,或者找到一种方法来处理字符串到正确的顺序。
什么是最有效的方式,让我使用
fasttime
?我应该如何重新排序字符串匹配?你会期望值为了使用fasttime
重新排序字符串,或者添加的修改字符串的要求使fasttime
储蓄可以忽略?
解决方案使用
sub
重新排序你的字符串第一,是的,我认为这将比使用基本as.POSIXct
:
$ bDT [,DateTime:= fastPOSIXct(sub('(\\d *)/(\\d *)/(\\d *) ','\\3 -\\1 -\\2 \\4',DateTime))]
你可能还可以使用
substr
而不是正则表达式来加快速度,但是会更麻烦。 p>I have a data.table with characters in two columns like so:
01/01/2014 | 00:30 02/01/2014 | 01:00 03/01/2014 | 01:30 etc
The length of this data set varies but is easily over 300,000 rows each time the script is run. Eventually I know this script will need to deal with a data set of 30,000,000 rows plus.
I currently
paste
them in the following form:DT[, DateTime := paste(Date, Time)
Which leads to:
01/01/2014 00:30 02/01/2014 01:00 03/01/2014 01:30 etc
I then use
as.POSIXct
to convert that into a POSIX date:DT[, DateTime:= as.POSIXct(x = DateTime, format = "%d/%m/%Y %H:%M")]
This works fine, converting the characters correctly, largely I believe because I set the format argument to match the structure of the character string it is fed.
However, I'd like to use the
fasttime
package, but there is an inherent problem in that it does not support aformat
argument to input. Therefore, when I run:DT[, DateTime := fastPOSIXct(x = DateTime)]
fasttime
has to interpret my data as the "order of interpretation is fixed: year, month, day, hour, minute, second." the output would come out like:2006/07/07 00:30 2007/07/07 01:00 2008/07/07 01:30 etc
Therfore, it seems I either must use
as.POSIXct
, or find a way to manipulate the string into the right order.What would be the most efficient way to allow me to use
fasttime
? How should I reorder the character string to match? Would you expect that it would be worth reordering the character strings in order to usefasttime
, or would the added requirement to correct the strings makefasttime
savings negligible?解决方案Use
sub
to reorder your string first, and yes, I think that's going to be much faster than using baseas.POSIXct
:DT[, DateTime := fastPOSIXct(sub('(\\d*)/(\\d*)/(\\d*) (.*)', '\\3-\\1-\\2 \\4', DateTime))]
You might also be able to speed this up more using
substr
instead of regular expressions, but it'll be much messier.这篇关于如何在data.table中最有效地重构字符串fasttime的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!