删除索引重复的 pandas 行 [英] Remove pandas rows with duplicate indices

查看:81
本文介绍了删除索引重复的 pandas 行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何删除索引值重复的行?

How to remove rows with duplicate index values?

在下面的天气数据框"中,有时科学家回去并更正观察结果-不是通过编辑错误的行,而是在文件末尾附加重复的行.

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

我正在从Web上读取一些自动气象数据(观测每5分钟发生一次,并被编入每个气象站的月度文件中.)解析文件后,DataFrame看起来像:

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

重复案例的示例:

import pandas 
import datetime

startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pandas.DataFrame(data=data1, index=index)
df2 = pandas.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)

df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

所以我需要 df3 最终成为:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

我认为添加一列行号( df3 ['rownum'] = range(df3.shape [0]))可以帮助我为任何值选择最底部的行 DatetimeIndex ,但是我仍然想弄清楚 group_by pivot (或???)语句来使之工作.

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

推荐答案

我建议使用虽然所有其他方法都可以使用,但是对于所提供的示例,当前接受的答案的效果最低.此外,虽然 groupby方法的性能稍差,但我发现重复的方法更具可读性.

While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

使用提供的示例数据:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

请注意,您可以通过将keep参数更改为'last'来保留最后一个元素.

Note that you can keep the last element by changing the keep argument to 'last'.

还应注意,此方法也适用于 MultiIndex (使用 Paul's示例):

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

这篇关于删除索引重复的 pandas 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆