删除重复的行,但将一个特定的值保留在一个列(pandas python) [英] Removing duplicated rows but keep the ones with a particular value in one column (pandas python)
问题描述
我想执行以下操作:
如果两列在3列(ID,符号和date)中具有完全相同的值),并在一列(消息)中具有X或T,然后删除这两行。但是,如果两列在相同的3列中具有相同的值,而在另一列中的值与X或T不同,则保持不变。
以下是我的数据框架示例:
df = pd.DataFrame({ID:[AA-1,AA-1,C-0,BB-2,BB-2 A,C,B,B],日期:[06/24/2014,06/24/2014,06/20/2013,06/25 / 2014,06/25/2015],message:[T,X,T,,]})
/ pre>
请注意,前两行对ID,符号和日期以及T和 X在列消息中。我想删除这两行。
但是,最后两行在ID,symbol和date列中具有相同的值,但是空白(不同于X或T)在列消息中。
我有兴趣将功能应用于具有数百万行的大型数据集。到目前为止,我已经尝试消耗了我所有的记忆,
谢谢,我感谢任何帮助,
解决方案这可能适用于您:
vals = ['X','T'
pd.concat([df [〜df.message.isin(vals)],df [df.message.isin(vals)]。loc [〜df.duplicated(subset = ['ID','date ','symbol'],keep = False),:]])
ID日期消息符号
3 BB-2 06/25/2014 B
4 BB-2 06/25/2015 B
2 C-0 06/20/2013 TC
相当快:
%% timeit
pd.concat([df [〜df.message.isin([' X','T'])],df [df.message.isin(['X','T'])]。loc [〜df.duplicated(subset = ['ID','date' '],keep = False),]])
100循环,最好3:1.99 ms每循环
%% timeit
df.groupby(['ID' ,'date','symbol'])。filter(lambda x:〜x.message.isin(['T','X'])。all())
100循环,最好3:2.71 ms每循环
替代方法是提供索引错误。
I would like to do the following:
If two rows have exactly the same value in 3 columns ("ID","symbol", and "date") and have either "X" or "T" in one column ("message"), then remove both of these rows. However, if two rows have the same value in the same 3 columns but a value different than "X" or "T" in the other column, then leave intact.
Here is an example of my data frame:
df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"], "symbol":["A","A","C","B","B"], "date":["06/24/2014","06/24/2014","06/20/2013","06/25/2014","06/25/2015"], "message": ["T","X","T","",""] })
Note that the first two rows have the same value values for the columns "ID","symbol", and "date", and "T" and "X" in the column "message". I would like to remove these two rows.
However, the last two rows have the same value in columns "ID","symbol", and "date", but blank (different than "X" or "T") in the column "message".
I am interested in applying the function to a large dataset with several million rows. So far what I have tried consumes all my memory,
thank you and I appreciate any help,
解决方案This might work for you:
vals = ['X', 'T'] pd.concat([df[~df.message.isin(vals)], df[df.message.isin(vals)].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]]) ID date message symbol 3 BB-2 06/25/2014 B 4 BB-2 06/25/2015 B 2 C-0 06/20/2013 T C
It's reasonably fast:
%%timeit pd.concat([df[~df.message.isin(['X', 'T'])], df[df.message.isin(['X', 'T'])].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]]) 100 loops, best of 3: 1.99 ms per loop %%timeit df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all()) 100 loops, best of 3: 2.71 ms per loop
The alternative was giving indexing errors.
这篇关于删除重复的行,但将一个特定的值保留在一个列(pandas python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!