根据日期时间在 pandas 数据框中选择数据 [英] select data based on datetime in pandas dataframe
问题描述
我正在尝试创建某种功能选择",使用户可以灵活地创建配置以选择熊猫数据帧中的数据.但是,我遇到了一些使我困惑的问题.
I am trying to create some sort of "functional select" that gives users flexibility to create configuration to select data in pandas dataframes. However I ran into some issues that puzzle me.
以下是一个简化的示例:
The following is a simplified example:
>>> import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range(start='2020-01-01', periods=4), 'val': [1, 2, 3, 4]})
>>> df
date val
0 2020-01-01 1
1 2020-01-02 2
2 2020-01-03 3
3 2020-01-04 4
问题1:为什么我在列上应用不同的函数时会得到不同的结果?
Question 1: Why do I get different result when I apply the function on the column differently?
>>> import datetime
>>> bydatetime = lambda x : x == datetime.date(2020, 1, 1)
>>> bydatetime(df['date'])
0 False
1 False
2 False
3 False
Name: date, dtype: bool
>>> df['date'].apply(bydatetime) # why does this one work?
0 True
1 False
2 False
3 False
Name: date, dtype: bool
但是,如果我使用numpy的 datetime64
或熊猫的 Timestamp
类型来创建lambda函数,它将可以正常工作.
However if I use numpy's datetime64
or pandas' Timestamp
types to create the lambda function, it would work.
>>> import numpy as np
>>> bynpdatetime = lambda x : x == np.datetime64('2020-01-01')
>>> bynpdatetime(df['date'])
0 True
1 False
2 False
3 False
Name: date, dtype: bool
>>> df['date'].apply(bynpdatetime)
0 True
1 False
2 False
3 False
Name: date, dtype: bool
>>> bypdtimestamp = lambda x : x == pd.Timestamp('2020-01-01')
>>> bypdtimestamp(df['date'])
0 True
1 False
2 False
3 False
Name: date, dtype: bool
>>> df['date'].apply(bypdtimestamp)
0 True
1 False
2 False
3 False
Name: date, dtype: bool
因此,我恢复使用以下简单选择,并且使用 datetime.date
无效.如果 datetime.date
不起作用,为什么 df ['date'].apply(bydatetime)
起作用?
So I reverted to use the following simple selection, and using datetime.date
didn't work. If datetime.date
just wouldn't work, why would df['date'].apply(bydatetime)
work?
>>> df[df['date'] == datetime.date(2020, 1, 1)]
Empty DataFrame
Columns: [date, val]
Index: []
>>> df[df['date'] == np.datetime64('2020-01-01')]
date val
0 2020-01-01 1
>>> df[df['date'] == pd.Timestamp('2020-01-01')]
date val
0 2020-01-01 1
最后但并非最不重要的一点是,为什么在选择一个单元格时,DataFrame中 date
列的类型 datetime64
但为什么是 Timestamp
?它们之间到底有什么区别?
Last but not least, why is the type of the date
column datetime64
in the DataFrame but Timestamp
when selected one cell? What is exactly the difference between them?
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 val 4 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 192.0 bytes
>>>
>>> df['date'][0]
Timestamp('2020-01-01 00:00:00')
我确信这里有一些我不了解的基本知识.非常感谢您所做的建设性工作.
I am sure there is something fundamental that I don't understand here. Thank you very much for anything constructive.
推荐答案
幸运的是,我有一个较旧版本的 pandas
(0.25),当您执行 bynpdatetime(df ['date'])
,它准确地解释了您为什么看到这种行为.关于如何处理此问题,因此看到此行为将是高度特定于版本的:
Luckily I have an older version of pandas
(0.25) and you get a warning when you do bynpdatetime(df['date'])
, which explains exactly why you see that behavior. There was a bit of back and forth on how to handle this so seeing this behavior will be highly version specific:
FutureWarning:将一系列日期时间与"datetime.date"进行比较.当前,"datetime.date"被强制为日期时间.将来大熊猫不会强制使用,并且这些值将不会等于'datetime.date'.要保留当前行为,请转换为将"datetime.date"更改为带有"pd.Timestamp"的日期时间.
FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the 'datetime.date' is coerced to a datetime. In the future pandas will not coerce, and 'the values will not compare equal to the 'datetime.date'. To retain the current behavior, convert the 'datetime.date' to a datetime with 'pd.Timestamp'.
pandas
中的日期时间功能是基于 np.datetime64
和 np.timedelta64
dtypes构建的.您不应使用datetime模块,因为它们做出的某些选择与标准库不一致.所有的意外行为都是由于这个原因.
Datetime functionality in pandas
is built upon the np.datetime64
and np.timedelta64
dtypes. You should not use the datetime module as they have made certain choices that are inconsistent with the standard library. All of the unintended behavior is because of this.
回答其他不相关的问题. datetime64
类似于数组类型或概念.该数组(在本例中为 pd.Series
)将由标量 timedelta64
对象组成.文档
To answer the other un-related question. datetime64
is like the array-type, or the concept. That array (in this case a pd.Series
) would be made up of scalar timedelta64
objects. This is explained in the documentation
这篇关于根据日期时间在 pandas 数据框中选择数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!