PANDAS中类似SQL的窗口函数:Python Pandas Dataframe中的行编号 [英] SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe

查看:2195
本文介绍了PANDAS中类似SQL的窗口函数:Python Pandas Dataframe中的行编号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我来自一个sql背景,我经常使用以下数据处理步骤:


  1. 将数据表分成一个或多个字段

  2. 对于每个分区,将一个rownumber添加到其中一行或多个其他字段的行中,其中分析人员指定升序或降序

EX:

  df = pd.DataFrame({'key1 ':['a','a','a','b','a'],
'data1':[1,2,2,3,3],
'data2 ':[1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a

我在看对于如何做PANDAS相当于这个sql窗口函数:

  RN = ROW_NUMBER()OVER(PARTI按键1,Key2 ORDER BY Data1 ASC,Data2 DESC)


data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4

我已经尝试了以下这些我没有分区的工作:

  def row_number (frame,orderby_columns,orderby_direction,name):
frame.sort_index(by = orderby_columns,ascending = orderby_direction,inplace = True)
frame [name] = list(xrange(len(frame.index)) )

我试图将这个想法扩展到使用分区(熊猫中的组),但是以下没有t工作:

  df1 = df.groupby('key1')。apply(lambda t:t.sort_index(by = 'data1','data2'],ascending = [True,False],inplace = True))reset_index() 

def nf(x):
x ['rn'] = list(xrange(len(x.index)))

df1 ['rn1'] = df1.groupby('key1')。apply(nf)

但是我刚刚得到了很多NaNs当我这样做。



理想情况下,将有一个简洁的方法来复制sql的窗口函数功能(我已经弄清了窗口的聚合..这是一个在大熊猫的班轮)...有人可以和我分享最常用的方式来在PANDAS中排列这样的行吗?

解决方案

您可以使用 groupby 两次以及排名方法:

 在[11]中:g = df.groupby('key1')

使用min方法参数给出共享相同数据1的值相同的RN:

 在[12]中:g ['data1']。rank(method ='min')
Out [12]:
0 1
1 2
2 2
3 1
4 4
dtype:float64

在[13] :df ['RN'] = g ['data1']。rank(method ='min')

然后对这些结果进行分组,并添加关于data2的排名:

 在[14]中:g1 = df.groupby (['key1','RN'])

在[15]中:g1 ['data2']。rank(ascending = False) - 1
Out [15]:
0 0
1 0
2 1
3 0
4 0
dtype:float64

在[16]中:df [ 'RN'] + = g1 ['data2']。rank(ascending = False) - 1

在[17]中:df
输出[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4

感觉就像这样做应该是一种本地的方法(可能会有! 。)。


I come from a sql background and I use the following data processing step frequently:

  1. Partition the table of data by one or more fields
  2. For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending

EX:

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
           'data1' : [1,2,2,3,3],
           'data2' : [1,10,2,3,30]})
df
     data1        data2     key1    
0    1            1         a           
1    2            10        a        
2    2            2         a       
3    3            3         b       
4    3            30        a        

I'm looking for how to do the PANDAS equivalent to this sql window function:

RN = ROW_NUMBER() OVER (PARTITION BY Key1, Key2 ORDER BY Data1 ASC, Data2 DESC)


    data1        data2     key1    RN
0    1            1         a       1    
1    2            10        a       2 
2    2            2         a       3
3    3            3         b       1
4    3            30        a       4

I've tried the following which I've gotten to work where there are no 'partitions':

def row_number(frame,orderby_columns, orderby_direction,name):
    frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
    frame[name] = list(xrange(len(frame.index)))

I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:

df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()

def nf(x):
    x['rn'] = list(xrange(len(x.index)))

df1['rn1'] = df1.groupby('key1').apply(nf)

But I just got a lot of NaNs when I do this.

Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?

解决方案

You can do this by using groupby twice along with the rank method:

In [11]: g = df.groupby('key1')

Use the min method argument to give values which share the same data1 the same RN:

In [12]: g['data1'].rank(method='min')
Out[12]:
0    1
1    2
2    2
3    1
4    4
dtype: float64

In [13]: df['RN'] = g['data1'].rank(method='min')

And then groupby these results and add the rank with respect to data2:

In [14]: g1 = df.groupby(['key1', 'RN'])

In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0    0
1    0
2    1
3    0
4    0
dtype: float64

In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1

In [17]: df
Out[17]:
   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

It feels like there ought to be a native way to do this (there may well be!...).

这篇关于PANDAS中类似SQL的窗口函数:Python Pandas Dataframe中的行编号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆