将 200 万行日期字符串加速转换为 POSIX.ct [英] Speedup conversion of 2 million rows of date strings to POSIX.ct
问题描述
我有一个包含大约 200 万行日期字符串的 csv,格式如下:
I have a csv which includes about 2 million rows of date strings in the format:
2012/11/13 21:10:00
让我们称之为 csv$Date.and.Time
我想尽快将这些日期(及其随附数据)转换为 xts
我已经编写了一个脚本,可以很好地执行转换(见下文),但它非常慢,我想尽可能加快速度.
I have written a script which performs the conversion just fine (see below), but it's terribly slow and I'd like to speed this up as much as possible.
这是我目前的方法.有没有人对如何加快速度有任何建议?
Here is my current methodology. Does anyone have any suggestions on how to make this faster?
dt <- as.POSIXct(csv$Date.and.Time,tz="UTC")
idx <- format(dt,tz=z,usetz=TRUE)
因此脚本将这些日期字符串转换为 POSIX.ct
.然后它使用 format
进行时区转换(z
是一个变量,代表我要转换的 TZ).然后,我进行常规 xts
调用,将其与 csv 中的其余数据一起制作为 xts 系列.
So the script converts these date strings to POSIX.ct
. It then does a timezone conversion using format
(z
is a variable representing the TZ to which I am converting). I then do a regular xts
call to make this an xts series with the rest of the data in the csv.
这 100% 有效.它只是非常非常缓慢.我试过并行运行它(它什么也没做;如果有的话,它会使情况变得更糟).慢"是什么意思?
This works 100%. It's just very, very slow. I've tried running this in parallel (it doesn't do anything; if anything it makes it worse). What do I mean by 'slow'?
user system elapsed
155.246 16.430 171.650
这是在 3GhZ、16GB ram 2012 mb pro 上.我可以在 Win7 机器上使用具有 32GB RAM 的类似处理器获得大约一半
That's on a 3GhZ, 16GB ram 2012 mb pro. I can get about half that on a similar processor with 32GB RAM on a Win7 Machine
我相信有人有更好的主意 - 我愿意通过 Rcpp
等提出建议.但是,理想情况下,该解决方案适用于 csv 而不是其他方法,例如设置建立一个数据库.话虽如此,我还是会通过任何能实现最快转换的方法来做到这一点.
I'm sure someone has a better idea - I'm open to suggestions via Rcpp
etc. However, ideally the solution works with the csv rather than some other method, like setting up a database. Having said that, I'm up to doing this via whatever method is going to give the fastest conversion.
我非常感谢任何帮助.提前致谢.
I'd be super appreciative of any help at all. Thanks in advance.
推荐答案
你想要 Simon 的小而简单的 fasttime 包,它可以做到这一点以最快的方式——不调用时间解析函数,而只是使用 C 级字符串函数.
You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.
它不支持像 strptime
那么多的格式.事实上,它甚至没有格式字符串.但是格式良好的 ISO 格式变体,即 yyyy-mm-dd hh:mm:ss.fff
将起作用,并且您的 /
分隔符也可能起作用.
It does not support as many formats as strptime
. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff
will work, and your /
separator may just work too.
这篇关于将 200 万行日期字符串加速转换为 POSIX.ct的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!