用于持续时间计算的时间戳的快速转换 [英] Fast conversion of timestamps for duration calculation

查看:63
本文介绍了用于持续时间计算的时间戳的快速转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个日志分析器,可以解析大约 100GB 的日志(我的测试文件是大约 2000 万行,1.8GB).它花费的时间比我们希望的要长(超过半天),所以我针对 cProfile 运行它,并且 strptime 占用了超过 75% 的时间:

We've got a log analyzer which parses logs on the order of 100GBs (my test file is ~20 million lines, 1.8GB). It's taking longer than we'd like (upwards of half a day), so I ran it against cProfile and >75% of the time is being taken by strptime:

       1    0.253    0.253  560.629  560.629 <string>:1(<module>)
20000423  202.508    0.000  352.246    0.000 _strptime.py:299(_strptime)

计算日志条目之间的持续时间,当前为:

to calculate the durations between log entries, currently as:

ltime = datetime.strptime(split_line[time_col].strip(), "%Y-%m-%d %H:%M:%S")
lduration = (ltime - otime).total_seconds()

其中 otime 是上一行的时间戳

where otime is the time stamp from the previous line

日志文件的格式如下:

0000 | 774 | 475      | 2017-03-29 00:06:47 | M      |        63
0001 | 774 | 475      | 2017-03-29 01:09:03 | M      |        63
0000 | 774 | 475      | 2017-03-29 01:19:50 | M      |        63
0001 | 774 | 475      | 2017-03-29 09:42:57 | M      |        63
0000 | 775 | 475      | 2017-03-29 10:24:34 | M      |        63
0001 | 775 | 475      | 2017-03-29 10:33:46 | M      |        63    

针对测试文件运行它需要将近 10 分钟.

It takes almost 10 minutes to run it against the test file.

用这个替换strptime()(来自这个问题):

def to_datetime(d):
    ltime = datetime.datetime(int(d[:4]), 
                              int(d[5:7]), 
                              int(d[8:10]), 
                              int(d[11:13]), 
                              int(d[14:16]), 
                              int(d[17:19]))

将时间缩短到 3 分钟多一点.

brings that down to just over 3 minutes.

cProfile 再次报告:

cProfile again reports:

       1    0.265    0.265  194.538  194.538 <string>:1(<module>)
20000423   62.688    0.000   62.688    0.000 analyzer.py:88(to_datetime)

这个转换仍然需要大约三分之一的时间来运行整个分析器.内联将转换占用空间减少了大约 20%,但我们仍然认为处理这些行的时间有 25% 是将时间戳转换为 datetime 格式(使用 total_seconds() 在此基础上再消耗约 5%).

this conversion is still taking about a third of the time for the entire analyzer to run. In-lining reduces the conversions footprint by about 20%, but we're still looking at 25% of the time to process these lines is converting the timestamp to datetime format (with total_seconds() consuming another ~5% on top of that).

我可能最终只写一个自定义时间戳到秒的转换来完全绕过 datetime,除非有人有另一个好主意?

I may end up just writing a custom timestamp to seconds conversion to bypass datetime entirely, unless someone has another bright idea?

推荐答案

所以我一直在寻找,我发现了一个非常出色的模块:

So I kept looking and I've found a module that does a fantastic job:

介绍 ciso8601:

from ciso8601 import parse_datetime
...
ltime = parse_datetime(sline[time_col].strip())

通过 cProfile:

Which, via cProfile:

       1    0.254    0.254  123.795  123.795 <string>:1(<module>)
20000423    4.188    0.000    4.188    0.000 {ciso8601.parse_datetime}

这比通过 datetime.strptime() 的朴素方法快约 84 倍...这并不奇怪,因为它们编写了一个 C 模块来完成它.

which is ~84x faster than the naive approach via datetime.strptime()... which is not surprising, given they wrote a C module to do it.

这篇关于用于持续时间计算的时间戳的快速转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆