带有日期时间索引的大 pandas read_csv的速度提高 [英] Speed-improvement on large pandas read_csv with datetime index
问题描述
我有很多看起来像这样的文件:
I have enormous files that look like this:
05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25
我可以使用以下内容轻松阅读它们:
I can easily read them using the following:
pd.read_csv(gzip.open("myfile.gz"), header=None,names=
["date","time","price","size","type","zero","empty","last"], parse_dates=[[0,1]])
有没有办法将这样的日期有效地解析为熊猫时间戳?如果没有,是否有编写可传递给date_parser =的cython函数的指南?
Is there any way to efficiently parse dates like this into pandas timestamps? If not, is there any guide for writing a cython function that can passed to date_parser= ?
我尝试编写自己的解析器函数,但是对于我正在处理的项目,它仍然花费太长时间.
I tried writing my own parser function and it still takes too long for the project I am working on.
推荐答案
以前的 Michael WS解决方案的改进:
- 转换为
pandas.Timestamp
最好在Cython代码之外执行 -
atoi
和处理native-c字符串的速度比python funcs快一点 -
datetime
-lib调用的数量从2个减少为1个(有时为+1个) - 微秒也会被处理
- conversion to
pandas.Timestamp
is better to perform outside the Cython code atoi
and processing native-c strings is a little-bit faster than python funcs- the number of
datetime
-lib calls is reduced to one from 2 (+1 occasional for date) - microseconds are also processed
NB!此代码中的日期顺序为日/月/年.
NB! The date order in this code is day/month/year.
所有代码似乎比原始convert_date_cython
快大约10倍.但是,如果在read_csv
之后调用它,则在SSD硬盘驱动器上,由于读取开销,总时间仅占百分之几.我猜想在普通硬盘上的差异会更小.
All in all the code seems to be approximately 10 times faster than the original convert_date_cython
. However if this is called after read_csv
then on SSD hard drive the difference is total time is only few percents due to the reading overhead. I would guess that on regular HDD the difference would be even smaller.
cimport numpy as np
import datetime
import numpy as np
import pandas as pd
from libc.stdlib cimport atoi, malloc, free
from libc.string cimport strcpy
### Modified code from Michael WS:
### https://stackoverflow.com/a/15812787/2447082
def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec):
cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms
cdef int N = len(date_vec)
cdef np.ndarray out_ar = np.empty(N, dtype=np.object)
cdef bytes prev_date = <bytes> 'xx/xx/xxxx'
cdef char *date_str = <char *> malloc(20)
cdef char *time_str = <char *> malloc(20)
for i in range(N):
if date_vec[i] != prev_date:
prev_date = date_vec[i]
strcpy(date_str, prev_date) ### xx/xx/xxxx
date_str[2] = 0
date_str[5] = 0
d_year = atoi(date_str+6)
d_month = atoi(date_str+3)
d_day = atoi(date_str)
strcpy(time_str, time_vec[i]) ### xx:xx:xx:xxxxxx
time_str[2] = 0
time_str[5] = 0
time_str[8] = 0
t_hour = atoi(time_str)
t_min = atoi(time_str+3)
t_sec = atoi(time_str+6)
t_ms = atoi(time_str+9)
out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms)
free(date_str)
free(time_str)
return pd.to_datetime(out_ar)
这篇关于带有日期时间索引的大 pandas read_csv的速度提高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!