重采样 pandas 中的布尔值 [英] Resampling boolean values in pandas

查看:113
本文介绍了重采样 pandas 中的布尔值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个属性,该属性对于在pandas中重新采样布尔值具有特殊意义.以下是一些时间序列数据:

I have run into a property which I find peculiar about resampling Booleans in pandas. Here is some time series data:

import pandas as pd
import numpy as np

dr = pd.date_range('01-01-2020 5:00', periods=10, freq='H')
df = pd.DataFrame({'Bools':[True,True,False,False,False,True,True,np.nan,np.nan,False],
                   "Nums":range(10)},
                  index=dr)

所以数据看起来像:

                     Bools  Nums
2020-01-01 05:00:00   True     0
2020-01-01 06:00:00   True     1
2020-01-01 07:00:00  False     2
2020-01-01 08:00:00  False     3
2020-01-01 09:00:00  False     4
2020-01-01 10:00:00   True     5
2020-01-01 11:00:00   True     6
2020-01-01 12:00:00    NaN     7
2020-01-01 13:00:00    NaN     8
2020-01-01 14:00:00  False     9

我本以为我可以在重采样时对布尔值列执行简单的操作(如求和),但是(按原样)这会失败:

I would have thought I could do simple operations (like a sum) on the boolean column when resampling, but (as is) this fails:

>>> df.resample('5H').sum()

                    Nums
2020-01-01 05:00:00    10
2020-01-01 10:00:00    35

布尔"列被删除.我对为什么会这样的印象是b/c列的dtypeobject.进行更改可以解决该问题:

The "Bools" column is dropped. My impression of why this happens was b/c the dtype of the column is object. Changing that remedies the issue:

>>> r = df.resample('5H')
>>> copy = df.copy() #just doing this to preserve df for the example
>>> copy['Bools'] = copy['Bools'].astype(float)
>>> copy.resample('5H').sum()

                     Bools  Nums
2020-01-01 05:00:00    2.0    10
2020-01-01 10:00:00    2.0    35

但是(奇怪的),您仍可以在不更改dtype的情况下通过索引重采样对象来对布尔值求和:

But (oddly) you can still sum the Booleans by indexing the resample object without changing the dtype:

>>> r = df.resample('5H')
>>> r['Bools'].sum()

2020-01-01 05:00:00    2
2020-01-01 10:00:00    2
Freq: 5H, Name: Bools, dtype: int64

如果唯一的列是布尔值,您仍然可以重新采样(尽管该列仍为object):

And also if the only column is the Booleans, you can still resample (despite the column still being object):

>>> df.drop(['Nums'],axis=1).resample('5H').sum()

                    Bools
2020-01-01 05:00:00      2
2020-01-01 10:00:00      2

什么使后两个示例起作用?我可以看到它们可能更明确(请,我真的很想重新采样此列!" )),但我不明白为什么原始的resample不允许操作是否可以完成.

What allows the latter two examples to work? I can see maybe they are a little more explicit ("Please, I really want to resample this column!"), but I don't see why the original resample doesn't allow the operation if it can be done.

推荐答案

好吧,向下跟踪显示:

df.resample('5H')['Bools'].sum == Groupby.sum (in pd.core.groupby.generic.SeriesGroupBy)

df.resample('5H').sum == sum (in pandas.core.resample.DatetimeIndexResampler)

并在groupby_function" rel ="nofollow noreferrer" > groupby.py 表明它等效于 r.agg(lambda x: np.sum(x, axis=r.axis)) 其中r = df.resample('5H')输出:

and tracking groupby_function in groupby.py shows that it's equivalent to r.agg(lambda x: np.sum(x, axis=r.axis)) where r = df.resample('5H') which outputs:

                     Bools  Nums  Nums2
2020-01-01 05:00:00      2    10     10
2020-01-01 10:00:00      2    35     35

好吧,实际上应该是r = df.resample('5H')['Bool'](仅适用于上述情况)

well, actually, it should've been r = df.resample('5H')['Bool'] (only for the case above)

并在

and tracking down the _downsample function in resample.py shows that it's equivalent to: df.groupby(r.grouper, axis=r.axis).agg(np.sum) which outputs:

                     Nums  Nums2
2020-01-01 05:00:00    10     10
2020-01-01 10:00:00    35     35

这篇关于重采样 pandas 中的布尔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆