在timestamp/datetime/datetime64类型的列上运行groupby时,如何正确使用pandas agg函数? [英] How to correctly use pandas agg function when running groupby on a column of type timestamp/datetime/datetime64?

查看:158
本文介绍了在timestamp/datetime/datetime64类型的列上运行groupby时,如何正确使用pandas agg函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解为什么直接在一个组上调用count()返回正确的答案(在此示例中,该组中的2行),但是通过agg()函数中的lambda调用count返回的开始纪元("1970-01-01 00:00:00.000000002").

I'm trying to understand why calling count() directly on a group returns the correct answer (in this example, 2 rows in that group), but calling count via a lambda in the agg() function returns the beginning of epoch ("1970-01-01 00:00:00.000000002").

# Using groupby(lambda x: True) in the code below just as an illustrative example.
# It will always create a single group.
x = DataFrame({'time': [np.datetime64('2005-02-25'), np.datetime64('2006-03-30')]}).groupby(lambda x: True)

display(x.count())
>>time
>>True  2

display(x.agg(lambda x: x.count()))
>>time
>>True  1970-01-01 00:00:00.000000002

这可能是熊猫中的虫子吗?我在用 熊猫版:0.16.1 IPython版本:3.1.0 numpy版本:1.9.2

Could this be a bug in pandas? I am using Pandas version: 0.16.1 IPython version: 3.1.0 numpy version: 1.9.2

无论是否使用标准的python datetime vs np.datetime64 vs pandas Timestamp,我都会得到相同的结果.

I get the same result regardless of whether I use the standard python datetime vs np.datetime64 vs the pandas Timestamp.

编辑(根据来自@jeff的公认答案,看来我可能需要在应用不返回日期时间类型的聚合函数之前强迫dtype对象):

EDIT (as per the accepted answer from @jeff, it looks like I may need to coerce to dtype object before applying an aggregation function that doesn't return a datetime type):

dt = [datetime.datetime(2012, 5, 1)] * 2
x = DataFrame({'time': dt})
x['time2'] = x['time'].astype(object)
display(x)
y = x.groupby(lambda x: True)
y.agg(lambda x: x.count())

>>time  time2
>>True  1970-01-01 00:00:00.000000002   2

推荐答案

此处x是上方的原始帧(不适用于您的groupby).通过UDF,例如lambda,在每个系列上都称为"lambda".所以这是函数的结果.

Here x is the original frame from above (not with your groupby). Passing a UDF, e.g. the lambda, calls this on each Series. So this is the result of the function.

In [35]: x.count()
Out[35]: 
time    2
dtype: int64

然后强制转换为该系列的原始dtype.结果是:

Then coercion to the original dtype of the Series happens. So the result is:

In [36]: Timestamp(2)
Out[36]: Timestamp('1970-01-01 00:00:00.000000002')

这正是您所看到的.强制原始dtype的目的是尽可能保留它.不这样做将对groupby结果产生更大的魔力.

which is exactly what you are seeing. The point of the coercion to the original dtype is to preserve it if at all possible. Not doing this would be even more magic on the groupby results.

这篇关于在timestamp/datetime/datetime64类型的列上运行groupby时,如何正确使用pandas agg函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆