合并数据帧并丢弃重复值 [英] Merge DataFrames and discard duplicates values

查看:118
本文介绍了合并数据帧并丢弃重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在收集来自各种文件的时间索引数据,但是有时会有一些重叠:

I'm collecting time-indexed data coming from various files, but sometimes there is some overlapping:

df1 = pd.DataFrame([1, -1, -3], columns=['A'], index=pd.date_range('2000-01-01', periods=3))
df2 = pd.DataFrame([-3, 10, 1], columns=['A'], index=pd.date_range('2000-01-03', periods=3))
pd.concat([df1, df2])

            A
2000-01-01  1
2000-01-02 -1
2000-01-03 -3

             A
2000-01-03  -3
2000-01-04  10
2000-01-05   1

             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-03  -3
2000-01-04  10
2000-01-05   1

1)如何清理和删除重复的行?(此处为2000-01-03)

1) How to clean and remove the duplicate lines ? (here 2000-01-03)

2)一般来说,与手动操作相比,pandas是否有更快/更聪明的方式来读取和合并多个csv文件:

2) More generally, is there a faster / more clever way with pandas to read and merge multiple csv files than doing manually:

L=[]
for f in glob.glob('*.csv'):
    L.append(pd.read_csv(f, ...))
fulldata = pd.concat(L)                   # this can be time consuming
fulldata.remove_duplicate_lines()         # this can be time consuming too

推荐答案

IIUC,您可以 pd.concat ,然后执行

IIUC you could do pd.concat and then do drop_duplicates:

In [104]: pd.concat([df1, df2]).drop_duplicates()
Out[104]: 
             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-04  10
2000-01-05   7

编辑

是的,该方法无法正常工作,因为它按值而不是按索引下降.对于索引,您可以 duplicated 表示index:

You are right, that method isn't working properly because it drops by value not by index. For index you could duplicated for index:

df = pd.concat([df1, df2])
df[~df.index.duplicated()]

In [107]: df[~df.index.duplicated()]
Out[107]: 
             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-04  10
2000-01-05   1

或者您可以使用第一种方法进行修改,首先需要执行reset_index,然后使用drop_duplicates,但是对于使用subset键的索引值:

Or you could use 1st method with modification, first you need to do reset_index, and then use drop_duplicates but for index values with subset key:

 pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')

In [118]: pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')
Out[118]: 
             A
index         
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-04  10
2000-01-05   1

这篇关于合并数据帧并丢弃重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆