pandas 日期字段的cut/qcut等于什么? [英] What's the equivalent of cut/qcut for pandas date fields?

查看:91
本文介绍了 pandas 日期字段的cut/qcut等于什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:从0.20.0版本开始,pandas cut/qcut会处理日期字段.有关更多信息,请参见新增功能.

Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's New for more.

pd.cut和pd.qcut现在支持datetime64和timedelta64 dtypes(GH14714,GH14798)

pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)

原始问题:Pandas cut和qcut函数非常适合对存储在数据透视表等中的连续数据进行存储",但是我看不到一种简单的方法来获取日期时间轴混合.令人沮丧的是,熊猫在所有与时间有关的东西上都很棒!

Original question: Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!

这是一个简单的例子:

def randomDates(size, start=134e7, end=137e7):
    return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')

df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10), 
                   'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df

     price      qty recd                ship
0    14.723510   3  2012-11-30 19:32:27 2013-03-08 23:10:12
1    53.535143   2  2012-07-25 14:26:45 2012-10-01 11:06:39
2    85.278743   7  2012-12-07 22:24:20 2013-02-26 10:23:20
3    35.940935   8  2013-04-18 13:49:43 2013-03-29 21:19:26
4    54.218896   8  2013-01-03 09:00:15 2012-08-08 12:50:41
5    61.404931   9  2013-02-10 19:36:54 2013-02-23 13:14:42
6    28.917693   1  2012-12-13 02:56:40 2012-09-08 21:14:45
7    88.440408   8  2013-04-04 22:54:55 2012-07-31 18:11:35
8    77.329931   7  2012-11-23 00:49:26 2012-12-09 19:27:40
9    46.540859   5  2013-03-13 11:37:59 2013-03-17 20:09:09

要按价格或数量分组进行分类,我可以使用cut/qcut对其进行分类:

To bin by groups of price or quantity, I can use cut/qcut to bucket them:

df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()

                       price  qty recd ship
qty     price               
(0, 1]  [14.724, 46.541]   1   1   1   1
(1, 5]  [14.724, 46.541]   2   2   2   2
        (46.541, 61.405]   1   1   1   1
(5, 10] [14.724, 46.541]   1   1   1   1
        (46.541, 61.405]   2   2   2   2
         (61.405, 88.44]   3   3   3   3

但是我看不到使用"recd"或"ship"日期字段执行相同操作的任何简便方法.例如,生成一个类似的计数表,该计数表按(例如)每月回收和装运的桶分类.看起来resample()拥有将所有机制都塞入句点的功能,但是我不知道如何在这里应用它. 截止日期"中的存储桶(或存储级别)相当于一个pandas.PeriodIndex,然后我想用落入的时间段来标记df ['recd']的每个值?

But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?

所以我要寻找的输出类似于:

So the kind of output I'm looking for would be something like:

ship    recv     count
2011-01 2011-01  1
        2011-02  3
        ...      ...
2011-02 2011-01  2
        2011-02  6
...     ...      ...

更一般而言,我希望能够混合并匹配输出中的连续或分类变量.想象一下df还包含一个带有红色/黄色/绿色值的状态"列,那么也许我想按状态,价格段,出货量和回收量来汇总计数,所以:

More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:

ship    recv     price   status count
2011-01 2011-01  [0-10)   green     1
                            red     4
                 [10-20) yellow     2
                  ...      ...    ...
        2011-02  [0-10)  yellow     3
        ...      ...       ...    ...

作为一个奖励问题,修改上面的groupby()结果以仅包含一个名为"count"的输出列的最简单方法是什么?

As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?

推荐答案

以下是使用pandas.PeriodIndex的解决方案(注意:PeriodIndex不 似乎支持带> 1的倍数的时间规则,例如"4M").我认为 您的红利问题的答案是.size().

Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't seem to support time rules with a multiple > 1, such as '4M'). I think the answer to your bonus question is .size().

In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
   ....:             pd.PeriodIndex(df.ship, freq='Q'),
   ....:             pd.cut(df['qty'], bins=[0,5,10]),
   ....:             pd.qcut(df['price'],q=2),
   ....:            ]).size()
Out[49]: 
                qty      price 
2012Q2  2013Q1  (0, 5]   [2, 5]    1
2012Q3  2013Q1  (5, 10]  [2, 5]    1
2012Q4  2012Q3  (5, 10]  [2, 5]    1
        2013Q1  (0, 5]   [2, 5]    1
                (5, 10]  [2, 5]    1
2013Q1  2012Q3  (0, 5]   (5, 8]    1
        2013Q1  (5, 10]  (5, 8]    2
2013Q2  2012Q4  (0, 5]   (5, 8]    1
        2013Q2  (0, 5]   [2, 5]    1

这篇关于 pandas 日期字段的cut/qcut等于什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆