删除重复的行,但将一个特定的值保留在一个列(pandas python) [英] Removing duplicated rows but keep the ones with a particular value in one column (pandas python)

查看:1094
本文介绍了删除重复的行,但将一个特定的值保留在一个列(pandas python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想执行以下操作:



如果两列在3列(ID,符号和date)中具有完全相同的值),并在一列(消息)中具有X或T,然后删除这两行。但是,如果两列在相同的3列中具有相同的值,而在另一列中的值与X或T不同,则保持不变。



以下是我的数据框架示例:

  df = pd.DataFrame({ID:[AA-1,AA-1,C-0,BB-2,BB-2 A,C,B,B],日期:[06/24/2014,06/24/2014,06/20/2013​​,06/25 / 2014,06/25/2015],message:[T,X,T,,]})
/ pre>

请注意,前两行对ID,符号和日期以及T和 X在列消息中。我想删除这两行。



但是,最后两行在ID,symbol和date列中具有相同的值,但是空白(不同于X或T)在列消息中。



我有兴趣将功能应用于具有数百万行的大型数据集。到目前为止,我已经尝试消耗了我所有的记忆,



谢谢,我感谢任何帮助,

解决方案

这可能适用于您:

  vals = ['X','T' 
pd.concat([df [〜df.message.isin(vals)],df [df.message.isin(vals)]。loc [〜df.duplicated(subset = ['ID','date ','symbol'],keep = False),:]])

ID日期消息符号
3 BB-2 06/25/2014 B
4 BB-2 06/25/2015 B
2 C-0 06/20/2013 TC

相当快:

  %% timeit 
pd.concat([df [〜df.message.isin([' X','T'])],df [df.message.isin(['X','T'])]。loc [〜df.duplicated(subset = ['ID','date' '],keep = False),]])
100循环,最好3:1.99 ms每循环

%% timeit
df.groupby(['ID' ,'date','symbol'])。filter(lambda x:〜x.message.isin(['T','X'])。all())
100循环,最好3:2.71 ms每循环

替代方法是提供索引错误。


I would like to do the following:

If two rows have exactly the same value in 3 columns ("ID","symbol", and "date") and have either "X" or "T" in one column ("message"), then remove both of these rows. However, if two rows have the same value in the same 3 columns but a value different than "X" or "T" in the other column, then leave intact.

Here is an example of my data frame:

df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"], "symbol":["A","A","C","B","B"], "date":["06/24/2014","06/24/2014","06/20/2013","06/25/2014","06/25/2015"], "message": ["T","X","T","",""] })

Note that the first two rows have the same value values for the columns "ID","symbol", and "date", and "T" and "X" in the column "message". I would like to remove these two rows.

However, the last two rows have the same value in columns "ID","symbol", and "date", but blank (different than "X" or "T") in the column "message".

I am interested in applying the function to a large dataset with several million rows. So far what I have tried consumes all my memory,

thank you and I appreciate any help,

解决方案

This might work for you:

vals = ['X', 'T']
pd.concat([df[~df.message.isin(vals)], df[df.message.isin(vals)].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]])

     ID        date message symbol
3  BB-2  06/25/2014              B
4  BB-2  06/25/2015              B
2   C-0  06/20/2013       T      C

It's reasonably fast:

%%timeit
pd.concat([df[~df.message.isin(['X', 'T'])], df[df.message.isin(['X', 'T'])].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]])
100 loops, best of 3: 1.99 ms per loop

%%timeit
df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all())
100 loops, best of 3: 2.71 ms per loop

The alternative was giving indexing errors.

这篇关于删除重复的行,但将一个特定的值保留在一个列(pandas python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆