推断日期格式与传递解析器 [英] Inferring date format versus passing a parser
问题描述
Pandas内部问题:我很惊讶地发现有几次在 pandas.read_csv $中将可调用对象显式传递给
date_parser
与仅使用 infer_datetime_format = True
相比,c $ c>的读取时间要慢得多。
Pandas internals question: I've been surprised to find a few times that explicitly passing a callable to date_parser
within pandas.read_csv
results in much slower read time than simply using infer_datetime_format=True
.
这是为什么?这两个选项之间的时间差异会是特定于日期格式的,还是其他因素会影响它们的相对时间?
Why is this? Will timing differences between these two options be date-format-specific, or what other factors will influence their relative timing?
在以下情况下, infer_datetime_format = True
花费以指定格式传递日期解析器的时间的十分之一。我会天真地认为后者会更快,因为它是明确的。
In the below case, infer_datetime_format=True
takes one-tenth the time of passing a date parser with a specified format. I would have naively assumed the latter would be faster because it's explicit.
文档中确实会注明
[如果为True,] pandas将尝试推断列中日期时间字符串的格式,如果可以推断出日期格式,请切换到更快的解析它们的方法。在某些情况下,这可以使解析速度提高5到10倍。
[if True,] pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
但没有给出太多详细信息,我无法工作
but there's not much detail given and I was unable to work my way fully through the source.
设置:
from io import StringIO
import numpy as np
import pandas as pd
np.random.seed(444)
dates = pd.date_range('1980', '2018')
df = pd.DataFrame(np.random.randint(0, 100, (len(dates), 2)),
index=dates).add_prefix('col').reset_index()
# Something reproducible to be read back in
buf = StringIO()
df.to_string(buf=buf, index=False)
def read_test(**kwargs):
# Not ideal for .seek() to eat up runtime, but alleviate
# this with more loops than needed in timing below
buf.seek(0)
return pd.read_csv(buf, sep='\s+', parse_dates=['index'], **kwargs)
# dateutil.parser.parser called in this case, according to docs
%timeit -r 7 -n 100 read_test()
18.1 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -r 7 -n 100 read_test(infer_datetime_format=True)
19.8 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Doesn't change with native Python datetime.strptime either
%timeit -r 7 -n 100 read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
187 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
我有兴趣了解一下推断
在内部发生的事情给它这个优势。我以前的理解是,一开始就已经进行了某种类型的推断,因为如果未传递任何 dateutil.parser.parser
,则将使用。
I'm interested in knowing a bit about what is going on internally with infer
to give it this advantage. My old understanding was that there was already some type of inference going on in the first place because dateutil.parser.parser
is used if neither is passed.
更新:对此进行了一些挖掘,但未能回答问题。
Update: did some digging on this but haven't been able to answer the question.
read_csv()
调用帮助函数,该函数依次调用 pd.core.tools.datetimes.to_datetime()
。该函数(仅可通过 pd.to_datetime()
访问)同时具有 infer_datetime_format
和格式
参数。
read_csv()
calls a helper function which in turn calls pd.core.tools.datetimes.to_datetime()
. That function (accessible as just pd.to_datetime()
) has both an infer_datetime_format
and a format
argument.
但是,在这种情况下,相对时间差异很大,不能反映上述情况:
However, in this case, the relative timings are very different and don't reflect the above:
s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000']*1000)
%timeit pd.to_datetime(s,infer_datetime_format=True)
19.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.to_datetime(s,infer_datetime_format=False)
1.01 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This was taking the longest with i/o functions,
# now it's behaving "as expected"
%timeit pd.to_datetime(s,format='%m/%d/%Y')
19 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
推荐答案
您已经确定了两个重要函数: read_csv
准备了一个要解析的函数使用 _make_date_converter
设置日期列,这总是会调用 to_datetime
(熊猫的主要字符串为-日期转换工具)。
You've identified the two important functions: read_csv
prepares a function to parse the date columns using _make_date_converter
, and this is always going make a call to_datetime
(pandas' primary string-to-date conversion tool).
@WillAyd和@bmbigbang的答案在我看来都是正确的,因为它们将缓慢的原因确定为重复调用lambda函数。
The answers by @WillAyd and @bmbigbang both seem correct to me in that they identify the cause of the slowness as repeated calls of the lambda function.
由于您要求提供有关熊猫源代码的更多详细信息,因此,我将尝试检查每个 read_test
调用。下面提供了更多详细信息,以找出我们最终在 to_datetime
中所处的位置,以及最终为什么会按照您观察到的数据进行计时。
Since you ask for more details about pandas' source code, I'll try and examine each read_test
call in more detail below to find out where we end up in to_datetime
and ultimately why the timings are as you observed for your data.
这是非常快的,因为,而没有任何有关可能的日期格式的提示,pandas将尝试解析字符串的类似列表的列,就好像它们大约在 ISO8601格式(这是一种非常常见的cas e)。
This is very fast because, without any prompts about a possible date format, pandas is going to try and parse the list-like column of strings as though they're approximately in ISO8601 format (which is a very common case).
插入 to_datetime
,我们很快到达此代码分支:
if result is None and (format is None or infer_datetime_format):
result = tslib.array_to_datetime(...)
从这里开始,它是一直编译Cython代码。
From here on, it's compiled Cython code all the way.
array_to_datetime
遍历字符串列,用于将每个列转换为日期时间格式。对于每一行,我们点击 _string_to_dts
在这行;然后转到另一个内联代码的简短代码段( _cstring_to_dts
),这意味着 parse_iso_8601_datetime
被调用来进行字符串的实际解析
array_to_datetime
iterates through the column of strings to convert each one to datetime format. For each row, we hit _string_to_dts
at this line; then we go to another short snippet of inlined code (_cstring_to_dts
) which means parse_iso_8601_datetime
is called to do the actual parsing of the string to a datetime object.
此函数不仅具有解析YYYY-MM-DD格式的日期的功能,而且还可以做一些整理工作(由 parse_iso_8601_datetime
填充的C结构成为正确的datetime对象,并检查了一些界限)。
This function is more than capable of parsing dates in the YYYY-MM-DD format and so there is just some housekeeping to finish the job (the C struct filled by parse_iso_8601_datetime
becomes a proper datetime object, some bounds are checked).
完全没有调用 dateutil.parser.parser
。
让我们看看为什么几乎和 read_test()
。
向熊猫询问日期时间格式(不传递 format
参数)意味着我们在此处在 to_datetime
中:
Asking pandas to infer the datetime format (and passing no format
argument) means we land here in to_datetime
:
if infer_datetime_format and format is None:
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
此调用 _guess_datetime_format_for_array >
,它将列中的第一个非空值赋给 _guess_ datetime_format
。这将尝试构建日期时间格式的字符串以用于将来的解析。 (在这里我的答案在它可以识别的格式上方有更多详细信息。)
This calls _guess_datetime_format_for_array
, which takes the first non-null value in the column and gives it to _guess_datetime_format
. This tries to build a datetime format string to use for future parsing. (My answer here has more detail above the formats it is able recognise.)
幸运的是,YYYY-MM-DD格式是此功能可以识别的格式。更为幸运的是,这种特殊的格式可以通过熊猫码快速访问!
Fortunately, the YYYY-MM-DD format is one that can be recognised by this function. Even more fortunately, this particular format has a fast-path through the pandas code!
您可以看到熊猫集 infer_datetime_format
回到 False
此处:
You can see pandas sets infer_datetime_format
back to False
here:
if format is not None:
# There is a special fast-path for iso8601 formatted
# datetime strings, so in those cases don't use the inferred
# format because this path makes process slower in this
# special case
format_is_iso8601 = _format_is_iso(format)
if format_is_iso8601:
require_iso8601 = not infer_datetime_format
format = None
这允许代码采用与上述相同的路径到 parse_iso_8601_datetime
函数。
This allows the code to take the same path as above to the parse_iso_8601_datetime
function.
我们提供了一个函数解析日期,因此pandas执行此代码块。
We've provided a function to parse the date with, so pandas executes this code block.
但是,这在内部引发异常:
However, this raises as exception internally:
strptime() argument 1 must be str, not numpy.ndarray
立即发现异常,pandas退回到使用 try_parse_dates
,然后调用 to_datetime
。
This exception is immediately caught, and pandas falls back to using try_parse_dates
before calling to_datetime
.
try_parse_dates
意味着不是在数组上调用,而是对此循环:
try_parse_dates
means that instead of being called on an array, the lambda function is called repeatedly for each value of the array in this loop:
for i from 0 <= i < n:
if values[i] == '':
result[i] = np.nan
else:
result[i] = parse_date(values[i]) # parse_date is the lambda function
尽管正在编译代码,但要付出对Python函数调用的代价码。与上面的其他方法相比,这非常慢。
Despite being compiled code, we pay the penalty of having function calls to Python code. This makes it very slow in comparison to the other approaches above.
回到 to_datetime
,我们现在有了一个对象装有 datetime
对象的数组。我们再次点击 array_to_datetime
,但这一次熊猫看到日期对象并使用另一个函数( pydate_to_dt64
)将其设置为datetime64对象。
Back in to_datetime
, we now have an object array filled with datetime
objects. Again we hit array_to_datetime
, but this time pandas sees a date object and uses another function (pydate_to_dt64
) to make it into a datetime64 object.
减速的原因实际上是由于重复调用了lambda函数。
The cause of the slowdown is really due to the repeated calls to the lambda function.
系列 s
在MM / DD中具有日期字符串/ YYYY格式。
The Series s
has date strings in the MM/DD/YYYY format.
这不是 ISO8601格式。 pd.to_datetime(s,infer_datetime_format = False)
尝试使用 parse_iso_8601_datetime
失败,但是c $ c> ValueError 。在此处处理此错误>:熊猫将使用 parse_datetime_string
。这意味着 dateutil.parser.parse
用于将字符串转换为日期时间。这就是为什么这种情况下速度很慢的原因:在循环中重复使用Python函数。
This is not an ISO8601 format. pd.to_datetime(s, infer_datetime_format=False)
tries to parse the string using parse_iso_8601_datetime
but this fails with a ValueError
. The error is handled here: pandas is going to use parse_datetime_string
instead. This means that dateutil.parser.parse
is used to convert the string to datetime. This is why it is slow in this case: repeated use of a Python function in a loop.
pd.to_datetime( s,format ='%m /%d /%Y')
和 pd.to_datetime(s,infer_datetime_format = True)
。后者使用 _guess_datetime_format_for_array
再次推断MM / DD / YYYY格式。然后都击中 array_strptime
此处:
There's not much difference between pd.to_datetime(s, format='%m/%d/%Y')
and pd.to_datetime(s, infer_datetime_format=True)
in terms of speed. The latter uses _guess_datetime_format_for_array
again to infer the MM/DD/YYYY format. Both then hit array_strptime
here:
if format is not None:
...
if result is None:
try:
result = array_strptime(arg, format, exact=exact, errors=errors)
array_strptime
是一种快速的Cython函数,用于将字符串数组解析为具有特定格式的datetime结构。
array_strptime
is a fast Cython function for parsing an array of strings to datetime structs given a specific format.
这篇关于推断日期格式与传递解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!