根据日期时间在 pandas 数据框中选择数据 [英] select data based on datetime in pandas dataframe

查看:28
本文介绍了根据日期时间在 pandas 数据框中选择数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建某种功能选择",使用户可以灵活地创建配置以选择熊猫数据帧中的数据.但是,我遇到了一些使我困惑的问题.

I am trying to create some sort of "functional select" that gives users flexibility to create configuration to select data in pandas dataframes. However I ran into some issues that puzzle me.

以下是一个简化的示例:

The following is a simplified example:

>>> import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range(start='2020-01-01', periods=4), 'val': [1, 2, 3, 4]})
>>> df
        date  val
0 2020-01-01    1
1 2020-01-02    2
2 2020-01-03    3
3 2020-01-04    4

问题1:为什么我在列上应用不同的函数时会得到不同的结果?

Question 1: Why do I get different result when I apply the function on the column differently?

>>> import datetime
>>> bydatetime = lambda x : x == datetime.date(2020, 1, 1)
>>> bydatetime(df['date'])
0    False
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bydatetime) # why does this one work?
0     True
1    False
2    False
3    False
Name: date, dtype: bool

但是,如果我使用numpy的 datetime64 或熊猫的 Timestamp 类型来创建lambda函数,它将可以正常工作.

However if I use numpy's datetime64 or pandas' Timestamp types to create the lambda function, it would work.

>>> import numpy as np
>>> bynpdatetime = lambda x : x == np.datetime64('2020-01-01')
>>> bynpdatetime(df['date'])
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bynpdatetime)
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> bypdtimestamp = lambda x : x == pd.Timestamp('2020-01-01')
>>> bypdtimestamp(df['date'])
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bypdtimestamp)
0     True
1    False
2    False
3    False
Name: date, dtype: bool

因此,我恢复使用以下简单选择,并且使用 datetime.date 无效.如果 datetime.date 不起作用,为什么 df ['date'].apply(bydatetime)起作用?

So I reverted to use the following simple selection, and using datetime.date didn't work. If datetime.date just wouldn't work, why would df['date'].apply(bydatetime) work?

>>> df[df['date'] == datetime.date(2020, 1, 1)]
Empty DataFrame
Columns: [date, val]
Index: []
>>> df[df['date'] == np.datetime64('2020-01-01')]
        date  val
0 2020-01-01    1
>>> df[df['date'] == pd.Timestamp('2020-01-01')]
        date  val
0 2020-01-01    1

最后但并非最不重要的一点是,为什么在选择一个单元格时,DataFrame中 date 列的类型 datetime64 但为什么是 Timestamp ?它们之间到底有什么区别?

Last but not least, why is the type of the date column datetime64 in the DataFrame but Timestamp when selected one cell? What is exactly the difference between them?

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    4 non-null      datetime64[ns]
 1   val     4 non-null      int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 192.0 bytes
>>>
>>> df['date'][0]
Timestamp('2020-01-01 00:00:00')

我确信这里有一些我不了解的基本知识.非常感谢您所做的建设性工作.

I am sure there is something fundamental that I don't understand here. Thank you very much for anything constructive.

推荐答案

幸运的是,我有一个较旧版本的 pandas (0.25),当您执行 bynpdatetime(df ['date']),它准确地解释了您为什么看到这种行为.关于如何处理此问题,因此看到此行为将是高度特定于版本的:

Luckily I have an older version of pandas (0.25) and you get a warning when you do bynpdatetime(df['date']), which explains exactly why you see that behavior. There was a bit of back and forth on how to handle this so seeing this behavior will be highly version specific:

FutureWarning:将一系列日期时间与"datetime.date"进行比较.当前,"datetime.date"被强制为日期时间.将来大熊猫不会强制使用,并且这些值将不会等于'datetime.date'.要保留当前行为,请转换为将"datetime.date"更改为带有"pd.Timestamp"的日期时间.

FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the 'datetime.date' is coerced to a datetime. In the future pandas will not coerce, and 'the values will not compare equal to the 'datetime.date'. To retain the current behavior, convert the 'datetime.date' to a datetime with 'pd.Timestamp'.

pandas 中的日期时间功能是基于 np.datetime64 np.timedelta64 dtypes构建的.您不应使用datetime模块,因为它们做出的某些选择与标准库不一致.所有的意外行为都是由于这个原因.

Datetime functionality in pandas is built upon the np.datetime64 and np.timedelta64 dtypes. You should not use the datetime module as they have made certain choices that are inconsistent with the standard library. All of the unintended behavior is because of this.

回答其他不相关的问题. datetime64 类似于数组类型或概念.该数组(在本例中为 pd.Series )将由标量 timedelta64 对象组成.文档

To answer the other un-related question. datetime64 is like the array-type, or the concept. That array (in this case a pd.Series) would be made up of scalar timedelta64 objects. This is explained in the documentation

这篇关于根据日期时间在 pandas 数据框中选择数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆