推断日期格式与传递解析器 [英] Inferring date format versus passing a parser

查看:96
本文介绍了推断日期格式与传递解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pandas内部问题:我很惊讶地发现有几次在 pandas.read_csv date_parser 与仅使用 infer_datetime_format = True 相比,c $ c>的读取时间要慢得多。

Pandas internals question: I've been surprised to find a few times that explicitly passing a callable to date_parser within pandas.read_csv results in much slower read time than simply using infer_datetime_format=True.

这是为什么?这两个选项之间的时间差异会是特定于日期格式的,还是其他因素会影响它们的相对时间?

Why is this? Will timing differences between these two options be date-format-specific, or what other factors will influence their relative timing?

在以下情况下, infer_datetime_format = True 花费以指定格式传递日期解析器的时间的十分之一。我会天真地认为后者会更快,因为它是明确的。

In the below case, infer_datetime_format=True takes one-tenth the time of passing a date parser with a specified format. I would have naively assumed the latter would be faster because it's explicit.

文档中确实会注明


[如果为True,] pandas将尝试推断列中日期时间字符串的格式,如果可以推断出日期格式,请切换到更快的解析它们的方法。在某些情况下,这可以使解析速度提高5到10倍。

[if True,] pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

但没有给出太多详细信息,我无法工作

but there's not much detail given and I was unable to work my way fully through the source.

设置:

from io import StringIO

import numpy as np
import pandas as pd

np.random.seed(444)
dates = pd.date_range('1980', '2018')
df = pd.DataFrame(np.random.randint(0, 100, (len(dates), 2)),
                  index=dates).add_prefix('col').reset_index()

# Something reproducible to be read back in
buf = StringIO()
df.to_string(buf=buf, index=False)

def read_test(**kwargs):
    # Not ideal for .seek() to eat up runtime, but alleviate
    # this with more loops than needed in timing below
    buf.seek(0)
    return pd.read_csv(buf, sep='\s+', parse_dates=['index'], **kwargs)

# dateutil.parser.parser called in this case, according to docs
%timeit -r 7 -n 100 read_test()
18.1 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -r 7 -n 100 read_test(infer_datetime_format=True)
19.8 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Doesn't change with native Python datetime.strptime either
%timeit -r 7 -n 100 read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
187 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

我有兴趣了解一下推断在内部发生的事情给它这个优势。我以前的理解是,一开始就已经进行了某种类型的推断,因为如果未传递任何 dateutil.parser.parser ,则将使用。

I'm interested in knowing a bit about what is going on internally with infer to give it this advantage. My old understanding was that there was already some type of inference going on in the first place because dateutil.parser.parser is used if neither is passed.

更新:对此进行了一些挖掘,但未能回答问题。

Update: did some digging on this but haven't been able to answer the question.

read_csv()调用帮助函数,该函数依次调用 pd.core.tools.datetimes.to_datetime() 。该函数(仅可通过 pd.to_datetime()访问)同时具有 infer_datetime_format 格式参数。

read_csv() calls a helper function which in turn calls pd.core.tools.datetimes.to_datetime(). That function (accessible as just pd.to_datetime()) has both an infer_datetime_format and a format argument.

但是,在这种情况下,相对时间差异很大,不能反映上述情况:

However, in this case, the relative timings are very different and don't reflect the above:

s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000']*1000)

%timeit pd.to_datetime(s,infer_datetime_format=True)
19.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.to_datetime(s,infer_datetime_format=False)
1.01 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This was taking the longest with i/o functions,
# now it's behaving "as expected"
%timeit pd.to_datetime(s,format='%m/%d/%Y')
19 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


推荐答案

您已经确定了两个重要函数: read_csv 准备了一个要解析的函数使用 _make_date_converter 设置日期列,这总是会调用 to_datetime (熊猫的主要字符串为-日期转换工具)。

You've identified the two important functions: read_csv prepares a function to parse the date columns using _make_date_converter, and this is always going make a call to_datetime (pandas' primary string-to-date conversion tool).

@WillAyd和@bmbigbang的答案在我看来都是正确的,因为它们将缓慢的原因确定为重复调用lambda函数。

The answers by @WillAyd and @bmbigbang both seem correct to me in that they identify the cause of the slowness as repeated calls of the lambda function.

由于您要求提供有关熊猫源代码的更多详细信息,因此,我将尝试检查每个 read_test 调用。下面提供了更多详细信息,以找出我们最终在 to_datetime 中所处的位置,以及最终为什么会按照您观察到的数据进行计时。

Since you ask for more details about pandas' source code, I'll try and examine each read_test call in more detail below to find out where we end up in to_datetime and ultimately why the timings are as you observed for your data.

这是非常快的,因为,而没有任何有关可能的日期格式的提示,pandas将尝试解析字符串的类似列表的列,就好像它们大约在 ISO8601格式(这是一种非常常见的cas e)。

This is very fast because, without any prompts about a possible date format, pandas is going to try and parse the list-like column of strings as though they're approximately in ISO8601 format (which is a very common case).

插入 to_datetime ,我们很快到达此代码分支

if result is None and (format is None or infer_datetime_format):
    result = tslib.array_to_datetime(...)

从这里开始,它是一直编译Cython代码。

From here on, it's compiled Cython code all the way.

array_to_datetime 遍历字符串列,用于将每个列转换为日期时间格式。对于每一行,我们点击 _string_to_dts 这行;然后转到另一个内联代码的简短代码段( _cstring_to_dts ),这意味着 parse_iso_8601_datetime 被调用来进行字符串的实际解析

array_to_datetime iterates through the column of strings to convert each one to datetime format. For each row, we hit _string_to_dts at this line; then we go to another short snippet of inlined code (_cstring_to_dts) which means parse_iso_8601_datetime is called to do the actual parsing of the string to a datetime object.

此函数不仅具有解析YYYY-MM-DD格式的日期的功能,而且还可以做一些整理工作(由 parse_iso_8601_datetime 填充的C结构成为正确的datetime对象,并检查了一些界限)。

This function is more than capable of parsing dates in the YYYY-MM-DD format and so there is just some housekeeping to finish the job (the C struct filled by parse_iso_8601_datetime becomes a proper datetime object, some bounds are checked).

完全没有调用 dateutil.parser.parser

让我们看看为什么几乎 read_test()

向熊猫询问日期时间格式(不传递 format 参数)意味着我们在此处 to_datetime 中:

Asking pandas to infer the datetime format (and passing no format argument) means we land here in to_datetime:

if infer_datetime_format and format is None:
    format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

此调用 _guess_datetime_format_for_array ,它将列中的第一个非空值赋给 _guess_ datetime_format 。这将尝试构建日期时间格式的字符串以用于将来的解析。 (在这里我的答案在它可以识别的格式上方有更多详细信息。)

This calls _guess_datetime_format_for_array, which takes the first non-null value in the column and gives it to _guess_datetime_format. This tries to build a datetime format string to use for future parsing. (My answer here has more detail above the formats it is able recognise.)

幸运的是,YYYY-MM-DD格式是此功能可以识别的格式。更为幸运的是,这种特殊的格式可以通过熊猫码快速访问!

Fortunately, the YYYY-MM-DD format is one that can be recognised by this function. Even more fortunately, this particular format has a fast-path through the pandas code!

您可以看到熊猫集 infer_datetime_format 回到 False 此处

You can see pandas sets infer_datetime_format back to False here:

if format is not None:
    # There is a special fast-path for iso8601 formatted
    # datetime strings, so in those cases don't use the inferred
    # format because this path makes process slower in this
    # special case
    format_is_iso8601 = _format_is_iso(format)
    if format_is_iso8601:
        require_iso8601 = not infer_datetime_format
        format = None

这允许代码采用与上述相同的路径 parse_iso_8601_datetime 函数。

This allows the code to take the same path as above to the parse_iso_8601_datetime function.

我们提供了一个函数解析日期,因此pandas执行此代码块

We've provided a function to parse the date with, so pandas executes this code block.

但是,这在内部引发异常:

However, this raises as exception internally:

strptime() argument 1 must be str, not numpy.ndarray

立即发现异常,pandas退回到使用 try_parse_dates ,然后调用 to_datetime

This exception is immediately caught, and pandas falls back to using try_parse_dates before calling to_datetime.

try_parse_dates 意味着不是在数组上调用,而是对此循环

try_parse_dates means that instead of being called on an array, the lambda function is called repeatedly for each value of the array in this loop:

for i from 0 <= i < n:
    if values[i] == '':
        result[i] = np.nan
    else:
        result[i] = parse_date(values[i]) # parse_date is the lambda function

尽管正在编译代码,但要付出对Python函数调用的代价码。与上面的其他方法相比,这非常慢。

Despite being compiled code, we pay the penalty of having function calls to Python code. This makes it very slow in comparison to the other approaches above.

回到 to_datetime ,我们现在有了一个对象装有 datetime 对象的数组。我们再次点击 array_to_datetime ,但这一次熊猫看到日期对象并使用另一个函数( pydate_to_dt64 )将其设置为datetime64对象。

Back in to_datetime, we now have an object array filled with datetime objects. Again we hit array_to_datetime, but this time pandas sees a date object and uses another function (pydate_to_dt64) to make it into a datetime64 object.

减速的原因实际上是由于重复调用了lambda函数。

The cause of the slowdown is really due to the repeated calls to the lambda function.

系列 s 在MM / DD中具有日期字符串/ YYYY格式。

The Series s has date strings in the MM/DD/YYYY format.

这不是 ISO8601格式。 pd.to_datetime(s,infer_datetime_format = False)尝试使用 parse_iso_8601_datetime 失败,但是c $ c> ValueError 。在此处处理此错误>:熊猫将使用 parse_datetime_string 。这意味着 dateutil.parser.parse 用于将字符串转换为日期时间。这就是为什么这种情况下速度很慢的原因:在循环中重复使用Python函数。

This is not an ISO8601 format. pd.to_datetime(s, infer_datetime_format=False) tries to parse the string using parse_iso_8601_datetime but this fails with a ValueError. The error is handled here: pandas is going to use parse_datetime_string instead. This means that dateutil.parser.parse is used to convert the string to datetime. This is why it is slow in this case: repeated use of a Python function in a loop.

pd.to_datetime( s,format ='%m /%d /%Y') pd.to_datetime(s,infer_datetime_format = True) 。后者使用 _guess_datetime_format_for_array 再次推断MM / DD / YYYY格式。然后都击中 array_strptime 此处

There's not much difference between pd.to_datetime(s, format='%m/%d/%Y') and pd.to_datetime(s, infer_datetime_format=True) in terms of speed. The latter uses _guess_datetime_format_for_array again to infer the MM/DD/YYYY format. Both then hit array_strptime here:

if format is not None:
    ...
    if result is None:
        try:
            result = array_strptime(arg, format, exact=exact, errors=errors)

array_strptime 是一种快速的Cython函数,用于将字符串数组解析为具有特定格式的datetime结构。

array_strptime is a fast Cython function for parsing an array of strings to datetime structs given a specific format.

这篇关于推断日期格式与传递解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆