删除具有重复索引的行(Pandas DataFrame和TimeSeries) [英] Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

查看:463
本文介绍了删除具有重复索引的行(Pandas DataFrame和TimeSeries)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从网络上读取一些自动气象数据.观测每5分钟发生一次,并被汇总到每个气象站的月度文件中.解析完文件后,DataFrame看起来像这样:

I'm reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I'm done parsing a file, the DataFrame looks something like this:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

我遇到的问题是,有时科学家会回过头来更正观察结果-不是通过编辑错误的行,而是通过将重复的行附加到文件末尾来进行的.这种情况的简单示例如下所示:

The problem I'm having is that sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file. Simple example of such a case is illustrated below:

import pandas 
import datetime
startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pandas.DataFrame(data=data1, index=index)
df2 = pandas.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)
df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

所以我需要df3才能成为:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

我认为添加一列行号(df3['rownum'] = range(df3.shape[0]))可以帮助我为DatetimeIndex的任何值选择最底端的行,但是我一直想弄清楚group_by(或???)语句使之起作用.

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select out the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

推荐答案

我建议使用虽然所有其他方法都可以使用,但是对于所提供的示例,当前接受的答案的效果最低.此外,虽然 groupby方法的性能稍差,但我发现重复的方法更具可读性.

While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

使用提供的示例数据:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

请注意,您可以通过更改keep参数来保留最后一个元素.

Note that you can keep the last element by changing the keep argument.

还应注意,该方法也适用于MultiIndex(使用 Paul的示例中指定的df1 ):

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

这篇关于删除具有重复索引的行(Pandas DataFrame和TimeSeries)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆