如何获得pandas数据框中的行,并在列中保留最大值并保留原始索引? [英] How to get rows in pandas data frame, with maximal values in a column and keep the original index?

查看:1713
本文介绍了如何获得pandas数据框中的行,并在列中保留最大值并保留原始索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框。在第一列中,它可以有多次相同的值(换句话说,第一列中的值不是唯一的)。



每当我有几行包含在第一列中相同的值,我想只留下那些在第三列中具有最大值的值。我几乎找到了一个解决方案:

  import pandas 

ls = []
ls。 append({'c1':'a','c2':'a','c3':1})
ls.append({'c1':'a','c2':'c' ,'c3':3})
ls.append({'c1':'a','c2':'b','c3':2})
ls.append({ c1':'b','c2':'b','c3':10})
ls.append({'c1':'b','c2':'c','c3' :12})
ls.append({'c1':'b','c2':'a','c3':7})

df = pandas.DataFrame( ls,columns = ['c1','c2','c3'])
print df
print'-------------------- '
print df.groupby('c1')。apply(lambda df:df.irow(df ['c3']。argmax()))

因此,我得到:

  c1 c2 c3 
0 aa 1
1 ac 3
2 ab 2
3 bb 10
4 bc 12
5 ba 7
------ --------------
c1 c2 c3
c1
aac 3
bbc 12

我的问题是,我不想让 c1 作为索引。我想要的是:

  c1 c2 c3 
1 ac 3
4 bc 12


解决方案当调用 df.groupby( ...)。apply(foo),由 foo 返回的对象类型会影响结果融合的方式。



如果您返回一个系列,则系列的索引将成为最终结果的列,并且groupby键将成为索引(有点令人费解)。



如果您返回一个DataFrame,则最终结果将DataFrame的索引用作索引值,将DataFrame的列用作列(非常合理)。
$ b

因此,您可以通过将Series转换为DataFrame来安排您想要的输出类型。



使用Pandas 0.13,您可以使用 to_frame()。T 方法:

  def maxrow(x, col):
return x.loc [x [col] .argmax()]。to_frame()。T

result = df.grou pby('c1')。apply(maxrow,'c3')
result = result.reset_index(level = 0,drop = True)
print(result)

yield

  c1 c2 c3 
1 ac 3
4 bc 12

在0.12以上的熊猫中, be:

  def maxrow(x,col):
ser = x.loc [x [col] .idxmax ()]
df = pd.DataFrame({ser.name:ser})。T
return df






顺便说一下, behzad.nouri的巧妙和优雅的解决方案比我的小型DataFrames更快。
sort 将时间复杂度从 O(n)提升为 O(但是,当它应用于较大的DataFrame时,它比上面显示的 to_frame 解决方案慢。



以下是我对它的基准测试:

 将pandas导入为pd 
将numpy导入为np
进口时间


def reset_df_first(df):
df2 = df.reset_index()
结果= df2.groupby('c1')。 apply(lambda x:x.loc [x ['c3']。idxmax()])
result.set_index(['index'],inplace = True)
返回结果

def maxrow(x,col):
result = x.loc [x [col] .argmax()]。to_frame()。T
返回结果

def using_to_frame(df):
result = df.groupby('c1')。apply(maxrow,'c3')
result.reset_index(level = 0,drop = True,inplace = True)
返回结果

def using_sort(df):
返回df.sort('c3')。groupby('c1',as_index = False).tail(1)
$ b (100,1000,2000)中
$ b:
df = pd.DataFrame({'c1':{0:'a',1:'a',2:'a' ,3:'b',4:'b',5:'b'},
'c2':{0:'a',1:'c',2:'b',3: b',4:'c',5:'a'},
'c3':{0:1,1:3,2:2,3:10,4:12,5:7}} )

df = pd.concat([df] * N)
df.reset_index(inplace = True,drop = True)

timing = dict()
for func in(reset_df_first,using_to_frame,using_sort):
timing [func] = timeit.timeit('m。{}(m.df)'.format(func .__ name__),
'输入__main__为m',
数字= 10)

print('For N = {}'。format(N))
for sortc(timing, key = timing.get):
print('{:< 20}:{:<0.3g}'。format(func .__ name__,timing [func]))
print




$ b

>对于N = 100
using_sort:0.018
using_to_frame:0.0265
reset_df_first:0.0303

对于N = 1000
using_to_frame:0.0358 \
using_sort:0.036 / this大概是这两种方法在性能方面的交叉点
reset_df_first:0.0432

对于N = 2000
using_to_frame:0.0457
reset_df_first:0.0523
using_sort :0.0569

reset_df_first 试过。)


I have a pandas data frame. In the first column it can have the same value several times (in other words, the values in the first column are not unique).

Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. I almost found a solution:

import pandas

ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})

df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))

As a result I get:

  c1 c2  c3
0  a  a   1
1  a  c   3
2  a  b   2
3  b  b  10
4  b  c  12
5  b  a   7
--------------------
   c1 c2  c3
c1          
a   a  c   3
b   b  c  12

My problem is that, I do not want to have c1 as index. What I want to have is following:

  c1 c2  c3
1  a  c   3
4  b  c  12

解决方案

When calling df.groupby(...).apply(foo), the type of object returned by foo affects the way the results are melded together.

If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister).

If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible).

So, you can arrange for the type of output you desire by converting your Series into a DataFrame.

With Pandas 0.13 you can use the to_frame().T method:

def maxrow(x, col):
    return x.loc[x[col].argmax()].to_frame().T

result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)

yields

  c1 c2  c3
1  a  c   3
4  b  c  12

In Pandas 0.12 or older, the equivalent would be:

def maxrow(x, col):
    ser = x.loc[x[col].idxmax()]
    df = pd.DataFrame({ser.name: ser}).T
    return df


By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. The sort lifts the time complexity from O(n) to O(n log n) however, so it becomes slower than the to_frame solution shown above when applied to larger DataFrames.

Here is how I benchmarked it:

import pandas as pd
import numpy as np
import timeit


def reset_df_first(df):
    df2 = df.reset_index()
    result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
    result.set_index(['index'], inplace=True)
    return result

def maxrow(x, col):
    result = x.loc[x[col].argmax()].to_frame().T
    return result

def using_to_frame(df):
    result = df.groupby('c1').apply(maxrow, 'c3')
    result.reset_index(level=0, drop=True, inplace=True)
    return result

def using_sort(df):
    return df.sort('c3').groupby('c1', as_index=False).tail(1)


for N in (100, 1000, 2000):
    df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
                       'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
                       'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})

    df = pd.concat([df]*N)
    df.reset_index(inplace=True, drop=True)

    timing = dict()
    for func in (reset_df_first, using_to_frame, using_sort):
        timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
                              'import __main__ as m ',
                              number=10)

    print('For N = {}'.format(N))
    for func in sorted(timing, key=timing.get):
        print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
    print

yields

For N = 100
using_sort          : 0.018
using_to_frame      : 0.0265
reset_df_first      : 0.0303

For N = 1000
using_to_frame      : 0.0358    \
using_sort          : 0.036     / this is roughly where the two methods cross over in terms of performance
reset_df_first      : 0.0432

For N = 2000
using_to_frame      : 0.0457
reset_df_first      : 0.0523
using_sort          : 0.0569

(reset_df_first was another possibility I tried.)

这篇关于如何获得pandas数据框中的行,并在列中保留最大值并保留原始索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆