有没有一种方法可以提高大文件的解析日期速度? [英] Is there a way to improve speed of parsing date for large file?

查看:31
本文介绍了有没有一种方法可以提高大文件的解析日期速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取一个大的csv文件,其中包含约1B行.我在解析日期时遇到了一个问题.Python的处理速度很慢.

I am reading a big csv file which has about 1B rows. I ran into a issue with parsing the date. Python is slow in the processing.

文件中的一行如下所示,'20170427,20:52:01.510,ABC,USD/MXN,1,OFFER,19.04274,9000000,9 @ 15 @ 8653948257753368229,0.0 \ n'

a single line in the file looks like the following, '20170427,20:52:01.510,ABC,USD/MXN,1,OFFER,19.04274,9000000,9@15@8653948257753368229,0.0\n'

如果我只浏览数据,则需要1分钟.

if I only look through the data, it takes 1 minute.

t0 = datetime.datetime.now()
i = 0
with open(r"QuoteData.txt") as file:
    for line in file:
        i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

129908976
0:01:09.871744

但是,如果我尝试解析日期时间,则将花费8分钟.

But if I tried to parse the datetime, it will take 8 minutes.

t0 = datetime.datetime.now()
i = 0
with open(r"D:\FxQuotes\ticks.log.20170427.txt") as file:
    for line in file:
        strings = line.split(",")

        datetime.datetime(
            int(strings[0][0:4]), # %Y
            int(strings[0][4:6]), # %m
            int(strings[0][6:8]), # %d
            int(strings[1][0:2]), # %H
            int(strings[1][3:5]), # %M
            int(strings[1][6:8]), # %s
            int(strings[1][9:]), # %f
        )    

        i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

129908976
0:08:13.687000

split()大约需要1分钟,而日期解析大约需要6分钟.我可以做些什么来改善这一点?

The split() takes about 1 minute, and the date parsing takes about 6 minutes. Is there anything I could do to improve this?

推荐答案

@TemporalWolf建议使用 ciso8601.我从未听说过它,所以我想尝试一下.

@TemporalWolf had the excellent suggestion of using ciso8601. I've never heard of it so I figured I'd give it a try.

首先,我用您的样品线对笔记本电脑进行了基准测试.我制作了一个CSV文件,其中包含1000万行确切的行,并且花了大约6秒钟才能读取所有内容.使用日期解析代码最多可以花费48秒,这很有意义,因为您还报告说它花费了8倍的时间.然后我将文件缩小到一百万行,我可以在0.6秒内读取它,并在4.8秒内解析日期,因此一切看起来都不错.

First, I benchmarked my laptop with your sample line. I made a CSV file with 10 million rows of that exact line and it took about 6 seconds to read everything. Using your date parsing code brought that up to 48 seconds which made sense because you also reported it taking 8x longer. Then I scaled the file down to 1 million rows and I could read it in 0.6 seconds and parse dates in 4.8 seconds so everything looked right.

然后我切换到 ciso8601 ,几乎就像魔术一样,一百万行的时间从4.8秒缩短到1.9秒:

Then I switched over to ciso8601 and, almost like magic, the time for 1 million rows went from 4.8 seconds to about 1.9 seconds:

import datetime
import ciso8601

t0 = datetime.datetime.now()
i = 0
with open('input.csv') as file:
    for line in file:
        strings = line.split(",")
        d = ciso8601.parse_datetime('%sT%s' % (strings[0], strings[1]))
        i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

请注意,您的数据已经几乎为iso8601格式.我只需要在日期和时间中间加上一个"T"即可.

Note that your data is almost in iso8601 format already. I just had to stick the date and time together with a "T" in the middle.

这篇关于有没有一种方法可以提高大文件的解析日期速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆