带有日期时间索引的大 pandas read_csv的速度提高 [英] Speed-improvement on large pandas read_csv with datetime index

查看:105
本文介绍了带有日期时间索引的大 pandas read_csv的速度提高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多看起来像这样的文件:

I have enormous files that look like this:

05/31/2012,15:30:00.029,130​​6.25,1,E,0,,1306.25

05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25

05/31/2012,15:30:00.029,130​​6.25,8,E,0,,1306.25

05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25

我可以使用以下内容轻松阅读它们:

I can easily read them using the following:

  pd.read_csv(gzip.open("myfile.gz"), header=None,names=
  ["date","time","price","size","type","zero","empty","last"], parse_dates=[[0,1]])

有没有办法将这样的日期有效地解析为熊猫时间戳?如果没有,是否有编写可传递给date_parser =的cython函数的指南?

Is there any way to efficiently parse dates like this into pandas timestamps? If not, is there any guide for writing a cython function that can passed to date_parser= ?

我尝试编写自己的解析器函数,但是对于我正在处理的项目,它仍然花费太长时间.

I tried writing my own parser function and it still takes too long for the project I am working on.

推荐答案

以前的 Michael WS解决方案的改进:

  • 转换为pandas.Timestamp最好在Cython代码之外执行
  • atoi和处理native-c字符串的速度比python funcs快一点
  • datetime -lib调用的数量从2个减少为1个(有时为+1个)
  • 微秒也会被处理
  • conversion to pandas.Timestamp is better to perform outside the Cython code
  • atoi and processing native-c strings is a little-bit faster than python funcs
  • the number of datetime-lib calls is reduced to one from 2 (+1 occasional for date)
  • microseconds are also processed

NB!此代码中的日期顺序为日/月/年.

NB! The date order in this code is day/month/year.

所有代码似乎比原始convert_date_cython快大约10倍.但是,如果在read_csv之后调用它,则在SSD硬盘驱动器上,由于读取开销,总时间仅占百分之几.我猜想在普通硬盘上的差异会更小.

All in all the code seems to be approximately 10 times faster than the original convert_date_cython. However if this is called after read_csv then on SSD hard drive the difference is total time is only few percents due to the reading overhead. I would guess that on regular HDD the difference would be even smaller.

cimport numpy as np
import datetime
import numpy as np
import pandas as pd
from libc.stdlib cimport atoi, malloc, free 
from libc.string cimport strcpy

### Modified code from Michael WS:
### https://stackoverflow.com/a/15812787/2447082

def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec):
    cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms
    cdef int N = len(date_vec)
    cdef np.ndarray out_ar = np.empty(N, dtype=np.object)  
    cdef bytes prev_date = <bytes> 'xx/xx/xxxx'
    cdef char *date_str = <char *> malloc(20)
    cdef char *time_str = <char *> malloc(20)

    for i in range(N):
        if date_vec[i] != prev_date:
            prev_date = date_vec[i] 
            strcpy(date_str, prev_date) ### xx/xx/xxxx
            date_str[2] = 0 
            date_str[5] = 0 
            d_year = atoi(date_str+6)
            d_month = atoi(date_str+3)
            d_day = atoi(date_str)

        strcpy(time_str, time_vec[i])   ### xx:xx:xx:xxxxxx
        time_str[2] = 0
        time_str[5] = 0
        time_str[8] = 0
        t_hour = atoi(time_str)
        t_min = atoi(time_str+3)
        t_sec = atoi(time_str+6)
        t_ms = atoi(time_str+9)

        out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms)
    free(date_str)
    free(time_str)
    return pd.to_datetime(out_ar)

这篇关于带有日期时间索引的大 pandas read_csv的速度提高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆