删除具有重复索引的 pandas 行 [英] Remove pandas rows with duplicate indices

查看:46
本文介绍了删除具有重复索引的 pandas 行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何删除具有重复索引值的行?

在下面的天气数据帧中,有时科学家会返回并更正观察结果 - 不是通过编辑错误的行,而是通过在文件末尾附加重复的行.

我正在从网络上读取一些自动天气数据(每 5 分钟进行一次观测,并编译为每个气象站的月度文件.)解析文件后,DataFrame 如下所示:

 Sta Precip1hr Precip5min Temp DewPnt WindSpd WindDir AtmPress日期2001-01-01 00:00:00 KPDX 0 0 4 3 0 0 30.312001-01-01 00:05:00 KPDX 0 0 4 3 0 0 30.302001-01-01 00:10:00 KPDX 0 0 4 3 4 80 30.302001-01-01 00:15:00 KPDX 0 0 3 2 5 90 30.302001-01-01 00:20:00 KPDX 0 0 3 2 10 110 30.28

重复案例示例:

导入熊猫导入日期时间startdate = datetime.datetime(2001, 1, 1, 0, 0)enddate = datetime.datetime(2001, 1, 1, 5, 0)index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')数据1 = {'A':范围(6),'B':范围(6)}数据 2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}df1 = pandas.DataFrame(data=data1, index=index)df2 = pandas.DataFrame(data=data2, index=index[:3])df3 = df2.append(df1)df3甲乙2001-01-01 00:00:00 20 -502001-01-01 01:00:00 -30 602001-01-01 02:00:00 40 -702001-01-01 03:00:00 3 32001-01-01 04:00:00 4 42001-01-01 05:00:00 5 52001-01-01 00:00:00 0 02001-01-01 01:00:00 1 12001-01-01 02:00:00 2 2

所以我需要 df3 最终变成:

 A B2001-01-01 00:00:00 0 02001-01-01 01:00:00 1 12001-01-01 02:00:00 2 22001-01-01 03:00:00 3 32001-01-01 04:00:00 4 42001-01-01 05:00:00 5 5

我认为添加一列行号 (df3['rownum'] = range(df3.shape[0])) 将帮助我选择最底行的任何值DatetimeIndex,但我一直在寻找 group_bypivot(或 ???)语句来完成这项工作.

解决方案

我建议使用 重复 Pandas 索引本身的方法:

df3 = df3[~df3.index.duplicated(keep='first')]

虽然所有其他方法都有效,但 .drop_duplicates 对于所提供的示例来说是迄今为止性能最低的.此外,虽然 groupby 方法 的性能稍差,但我发现重复的方法更具可读性.

使用提供的示例数据:

<预><代码>>>>%timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')1000 个循环,最好的 3 个:每个循环 1.54 毫秒>>>%timeit df3.groupby(df3.index).first()1000 个循环,最好的 3 个:每个循环 580 µs>>>%timeit df3[~df3.index.duplicated(keep='first')]1000 个循环,最好的 3 个:每个循环 307 µs

请注意,您可以通过将 keep 参数更改为 'last' 来保留最后一个元素.

还应注意,此方法也适用于 MultiIndex(使用 Paul's示例):

<预><代码>>>>%timeit df1.groupby(level=df1.index.names).last()1000 个循环,最好的 3 个:每个循环 771 µs>>>%timeit df1[~df1.index.duplicated(keep='last')]1000 个循环,最好的 3 个:每个循环 365 µs

How to remove rows with duplicate index values?

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

Example of a duplicate case:

import pandas 
import datetime

startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pandas.DataFrame(data=data1, index=index)
df2 = pandas.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)

df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

And so I need df3 to eventually become:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

解决方案

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

这篇关于删除具有重复索引的 pandas 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆