根据python中另一个标签文件中的条件从标签文件中删除行 [英] Remove lines from a tab file according to conditions in another tab file in python

查看:159
本文介绍了根据python中另一个标签文件中的条件从标签文件中删除行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿,我有两个标签文件,例如: file1.txt

He llo I have two tab file such as : file1.txt

Clustername Seqname1 Seqname2
Cluster1 Seq1(+) SeqA
Cluster1 Seq2(-) SeqA
Cluster1 Seq3(+) SeqB
Cluster1 Seq300(+) SeqB
Cluster1 Seq90(+) SeqL
Cluster1 Seq90(+) SeqO
Cluster1 Seq2(-) SeqC
Cluster2 Seq8(-) SeqY
Cluster2 Seq8(-) SeqH
Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY
Cluster3 Seq10(+) SeqK
Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT
Cluster4 Seq300(+) SeqB

file2.txt

Clustername Names
Cluster1    SeqA
Cluster1    Seq1(+)
Cluster1    SeqC
Cluster1    Seq2(-)
Cluster1    SeqO
Cluster1    Seq3(+)
Cluster1    Seq90(+)
Cluster1    SeqB
Cluster1    SeqG
Cluster2    Seq8(-)
Cluster2    SeqY
Cluster2    SeqH
Cluster3    Seq10(+)
Cluster3    SeqK
Cluster4    SeqB
Cluster4    Seq300(+)

正如您在file2.txt中看到的

SeqL Cluster1中不存在,那么我想删除该行: Cluster1 Seq90(+) SeqL来自 file1.txt

as you can see in file2.txt SeqL is not present in the Cluster1, then I want to remove the line : Cluster1 Seq90(+) SeqL from the file1.txt

Seq300(+)Cluster1中也不存在,然后我删除了该行:

Seq300(+) is not present either in Cluster1, then I remove the line:

Cluster1 Seq300(+) SeqB

来自 file1.txt

相同于:

Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY

file2.txt 中的CLuster2中也没有SeqP,在Cluster2中也没有Seq79(-),然后我删除了以下行:

there is no SeqP in CLuster2 nor Seq79(-) in Cluster2 in file2.txt, then I remove lines:

Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY

来自 file1.txt

相同于:

Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT

因为SeqSSeqT不在 file2.txt Cluster2中,所以我从 file1.txt 中删除了以下两行: /p>

because SeqS and SeqT are not in Cluster2 in file2.txt, then I remove the two following lines from the file1.txt:

 Cluster3 Seq10(+) SeqS
 Cluster3 Seq10(+) SeqT

最后我应该得到一个ex file1.txt,例如:

at the end I should get an ex file1.txt such as:

Clustername Seqname1 Seqname2
Cluster1 Seq1(+) SeqA
Cluster1 Seq2(-) SeqA
Cluster1 Seq3(+) SeqB
Cluster1 Seq90(+) SeqO
Cluster1 Seq2(-) SeqC
Cluster2 Seq8(-) SeqY
Cluster2 Seq8(-) SeqH
Cluster3 Seq10(+) SeqK
Cluster4 Seq300(+) SeqB

推荐答案

使用 DataFrame.reindex 以获得原始列:

Use DataFrame.merge + DataFrame.reindex to get the original columns:

new_df=( df1.merge(df2,left_on=['Clustername','Seqname1'],right_on=['Clustername','Names'])
            .merge(df2,left_on=['Clustername','Seqname2'],right_on=['Clustername','Names'])
            .reindex(columns=df1.columns))
print(new_df)

输出

  Clustername   Seqname1 Seqname2
0    Cluster1    Seq1(+)     SeqA
1    Cluster1    Seq2(-)     SeqA
2    Cluster1    Seq2(-)     SeqC
3    Cluster1    Seq3(+)     SeqB
4    Cluster1   Seq90(+)     SeqO
5    Cluster2    Seq8(-)     SeqY
6    Cluster2    Seq8(-)     SeqH
7    Cluster3   Seq10(+)     SeqK
8    Cluster4  Seq300(+)     SeqB


n个seqnames列的解决方案:

df1['aux']=df1.groupby('Clustername').cumcount()

new_df= ( df1.melt(['Clustername','aux'],var_name='Seq')
             .merge(df2,left_on=['Clustername','value'],right_on=['Clustername','Names'])
             .groupby(['Clustername','aux'])
             .filter(lambda x: x.value.size>=(len(df1.columns)-2))
             .pivot_table(index=['Clustername','aux'],columns='Seq',values='value',aggfunc=''.join)
             .reset_index()
             .drop('aux',axis=1)
             .rename_axis(columns=None) )

print(new_df)

输出

  Clustername   Seqname1 Seqname2
0    Cluster1    Seq1(+)     SeqA
1    Cluster1    Seq2(-)     SeqA
2    Cluster1    Seq3(+)     SeqB
3    Cluster1   Seq90(+)     SeqO
4    Cluster1    Seq2(-)     SeqC
5    Cluster2    Seq8(-)     SeqY
6    Cluster2    Seq8(-)     SeqH
7    Cluster3   Seq10(+)     SeqK
8    Cluster4  Seq300(+)     SeqB

这篇关于根据python中另一个标签文件中的条件从标签文件中删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆