Python:Pandas-按组删除第一行 [英] Python: Pandas - Delete the first row by group

查看:1406
本文介绍了Python:Pandas-按组删除第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下看起来像这样的大数据框(df):

I have the following large dataframe (df) that looks like this:

    ID     date        PRICE       
1   10001  19920103  14.500    
2   10001  19920106  14.500    
3   10001  19920107  14.500     
4   10002  19920108  15.125     
5   10002  19920109  14.500   
6   10002  19920110  14.500    
7   10003  19920113  14.500 
8   10003  19920114  14.500     
9   10003  19920115  15.000 

问题:删除(或删除)每个ID第一行的最有效方法是什么?我想要这个:

Question: What's the most efficient way to delete (or remove) the first row of each ID? I want this:

        ID     date     PRICE       
    2   10001  19920106  14.500    
    3   10001  19920107  14.500     
    5   10002  19920109  14.500   
    6   10002  19920110  14.500    
    8   10003  19920114  14.500     
    9   10003  19920115  15.000 

我可以对每个唯一的ID进行循环,然后删除第一行,但我认为这样做效率不高.

I can do a loop over each unique ID and remove the first row but I believe this is not very efficient.

推荐答案

您可以使用groupby/transform准备一个布尔掩码,该布尔掩码对于想要的行为True,对于不需要的行为False.一旦有了这样的布尔掩码,就可以使用df.loc[mask]:

You could use groupby/transform to prepare a boolean mask which is True for the rows you want and False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],
     'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],
     'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
              19920113, 19920114, 19920115]},
    index = range(1,10)) 

def mask_first(x):
    result = np.ones_like(x)
    result[0] = 0
    return result

mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
print(df.loc[mask])

收益

      ID  PRICE      date
2  10001   14.5  19920106
3  10001   14.5  19920107
5  10002   14.5  19920109
6  10002   14.5  19920110
8  10003   14.5  19920114
9  10003   15.0  19920115


由于您对效率感兴趣,因此这里是一个基准:


Since you're interested in efficiency, here is a benchmark:

import timeit
import operator
import numpy as np
import pandas as pd

N = 10000
df = pd.DataFrame(
    {'ID': np.random.randint(100, size=(N,)),
     'PRICE': np.random.random(N),
     'date': np.random.random(N)}) 

def using_mask(df):
    def mask_first(x):
        result = np.ones_like(x)
        result[0] = 0
        return result

    mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
    return df.loc[mask]

def using_apply(df):
    return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])

def using_apply_alt(df):
    return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])

timing = dict()
for func in (using_mask, using_apply, using_apply_alt):
    timing[func] = timeit.timeit(
        '{}(df)'.format(func.__name__), 
        'from __main__ import df, {}'.format(func.__name__), number=100)

for func, t in sorted(timing.items(), key=operator.itemgetter(1)):
    print('{:16}: {:.2f}'.format(func.__name__, t))

报告

using_mask      : 0.85
using_apply_alt : 2.04
using_apply     : 3.70

这篇关于Python:Pandas-按组删除第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆