根据python中另一个标签文件中的条件从标签文件中删除行 [英] Remove lines from a tab file according to conditions in another tab file in python
问题描述
嘿,我有两个标签文件,例如: file1.txt
He llo I have two tab file such as : file1.txt
Clustername Seqname1 Seqname2
Cluster1 Seq1(+) SeqA
Cluster1 Seq2(-) SeqA
Cluster1 Seq3(+) SeqB
Cluster1 Seq300(+) SeqB
Cluster1 Seq90(+) SeqL
Cluster1 Seq90(+) SeqO
Cluster1 Seq2(-) SeqC
Cluster2 Seq8(-) SeqY
Cluster2 Seq8(-) SeqH
Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY
Cluster3 Seq10(+) SeqK
Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT
Cluster4 Seq300(+) SeqB
file2.txt
Clustername Names
Cluster1 SeqA
Cluster1 Seq1(+)
Cluster1 SeqC
Cluster1 Seq2(-)
Cluster1 SeqO
Cluster1 Seq3(+)
Cluster1 Seq90(+)
Cluster1 SeqB
Cluster1 SeqG
Cluster2 Seq8(-)
Cluster2 SeqY
Cluster2 SeqH
Cluster3 Seq10(+)
Cluster3 SeqK
Cluster4 SeqB
Cluster4 Seq300(+)
正如您在file2.txt
中看到的
SeqL 在Cluster1
中不存在,那么我想删除该行:
Cluster1 Seq90(+) SeqL
来自 file1.txt
as you can see in file2.txt
SeqL is not present in the Cluster1
, then I want to remove the line :
Cluster1 Seq90(+) SeqL
from the file1.txt
Seq300(+)
在Cluster1
中也不存在,然后我删除了该行:
Seq300(+)
is not present either in Cluster1
, then I remove the line:
Cluster1 Seq300(+) SeqB
来自 file1.txt
相同于:
Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY
在 file2.txt 中的CLuster2
中也没有SeqP
,在Cluster2
中也没有Seq79(-)
,然后我删除了以下行:
there is no SeqP
in CLuster2
nor Seq79(-)
in Cluster2
in file2.txt, then I remove lines:
Cluster2 Seq8(-) SeqP
Cluster2 Seq79(-) SeqY
来自 file1.txt
相同于:
Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT
因为SeqS
和SeqT
不在 file2.txt 的Cluster2
中,所以我从 file1.txt 中删除了以下两行: /p>
because SeqS
and SeqT
are not in Cluster2
in file2.txt, then I remove the two following lines from the file1.txt:
Cluster3 Seq10(+) SeqS
Cluster3 Seq10(+) SeqT
最后我应该得到一个ex file1.txt,例如:
at the end I should get an ex file1.txt such as:
Clustername Seqname1 Seqname2
Cluster1 Seq1(+) SeqA
Cluster1 Seq2(-) SeqA
Cluster1 Seq3(+) SeqB
Cluster1 Seq90(+) SeqO
Cluster1 Seq2(-) SeqC
Cluster2 Seq8(-) SeqY
Cluster2 Seq8(-) SeqH
Cluster3 Seq10(+) SeqK
Cluster4 Seq300(+) SeqB
推荐答案
使用 DataFrame.reindex
以获得原始列:
Use DataFrame.merge
+ DataFrame.reindex
to get the original columns:
new_df=( df1.merge(df2,left_on=['Clustername','Seqname1'],right_on=['Clustername','Names'])
.merge(df2,left_on=['Clustername','Seqname2'],right_on=['Clustername','Names'])
.reindex(columns=df1.columns))
print(new_df)
输出
Clustername Seqname1 Seqname2
0 Cluster1 Seq1(+) SeqA
1 Cluster1 Seq2(-) SeqA
2 Cluster1 Seq2(-) SeqC
3 Cluster1 Seq3(+) SeqB
4 Cluster1 Seq90(+) SeqO
5 Cluster2 Seq8(-) SeqY
6 Cluster2 Seq8(-) SeqH
7 Cluster3 Seq10(+) SeqK
8 Cluster4 Seq300(+) SeqB
n个seqnames列的解决方案:
df1['aux']=df1.groupby('Clustername').cumcount()
new_df= ( df1.melt(['Clustername','aux'],var_name='Seq')
.merge(df2,left_on=['Clustername','value'],right_on=['Clustername','Names'])
.groupby(['Clustername','aux'])
.filter(lambda x: x.value.size>=(len(df1.columns)-2))
.pivot_table(index=['Clustername','aux'],columns='Seq',values='value',aggfunc=''.join)
.reset_index()
.drop('aux',axis=1)
.rename_axis(columns=None) )
print(new_df)
输出
Clustername Seqname1 Seqname2
0 Cluster1 Seq1(+) SeqA
1 Cluster1 Seq2(-) SeqA
2 Cluster1 Seq3(+) SeqB
3 Cluster1 Seq90(+) SeqO
4 Cluster1 Seq2(-) SeqC
5 Cluster2 Seq8(-) SeqY
6 Cluster2 Seq8(-) SeqH
7 Cluster3 Seq10(+) SeqK
8 Cluster4 Seq300(+) SeqB
这篇关于根据python中另一个标签文件中的条件从标签文件中删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!