给定另一个包含了来自ListA的部分字符串的ListB,对TSV字符串的ListA进行排序 [英] Sort ListA of TSV strings given another ListB that contains partial string from ListA
问题描述
我有2个列表:ListA
包含由制表符分隔的字符串,而ListB
包含与ListA
中的字符串部分匹配的字符串.我想通过使ListB
的部分字符串与ListA
中的字符串匹配,以ListB
中找到的相同顺序对ListA
中的字符串进行排序.
I have 2 lists: ListA
that contains strings delimited by tabs and ListB
that contains strings that partially match the strings in ListA
. I would like to order the strings in ListA
in the same order found in ListB
by having the partial strings of ListB
match the strings in ListA
.
我尝试的是在ListA
上循环,用\t
分隔每行,用_
分隔第5列,并将字符串附加到临时ListC
.然后,我订购了ListC
,但是我仍然不知道如何在给定ListC
的情况下订购实际的ListA
.
What I tried is to loop on ListA
, split each line by \t
, split the 5th column by _
and append the string to a temporary ListC
. Then, I ordered ListC
but I still don't know how I can order the actual ListA
given ListC
.
ListA = ['rs141130360\tchr1:16495\tC\t653635\tNC_024540.1\tTranscript\tintron_variant,non_coding_transcript_variant\t-\t-\t-\t-\t-\trs3210724\tG\tMODIFIER\t-\t-1\t-\tSNV\tWASH7P\tEntrezGene\tHGNC:38034\ttranscribed_pseudogene\t-\t-\t-\t-\t-\t-\t-\t-\t-\tRefSeq\tG\tG\tOK\t-\t-\t-\t-\t8/10\t-\t-\tNR_024540.1:n.1080+112C>G\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\n',
'rs141130360\tchr1:16495\tC\t100287102\tNR_046018.2\tTranscript\tdownstream_gene_variant\t-\t-\t-\t-\t-\trs3210724\tG\tMODIFIER\t2086\t1\t-\tSNV\tDDX11L1\tEntrezGene\tHGNC:37102\ttranscribed_pseudogene\t-\t-\t-\t-\t-\t-\t-\t-\t-\tRefSeq\tG\tG\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\n',
'rs141130360\tchr1:16495\tC\t102466751\tNG_106918.1\tTranscript\tdownstream_gene_variant\t-\t-\t-\t-\t-\trs3210724\tG\tMODIFIER\t874\t-1\t-\tSNV\tMIR6859-1\tEntrezGene\tHGNC:50039\tmiRNA\t-\t-\t-\t-\t-\t-\t-\t-\t-\tRefSeq\tG\tG\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\n']
ListB = ["NC", "NG", "NM", "NP", "NR", "XM", "XP", "XR", "WP"]
ListC = []
for i in ListA:
i_split = i.split("\t")[4].split("_")[0]
ListC.append(i_split)
ListC = sorted(ListC, key=lambda x: ListB.index(x))
print(ListC)
将打印:
['NC', 'NG', 'NR']
我的预期结果如下:
['rs141130360\tchr1:16495\tC\t653635\tNC_024540.1\tTranscript\tintron_variant,non_coding_transcript_variant\t-\t-\t-\t-\t-\trs3210724\tG\tMODIFIER\t-\t-1\t-\tSNV\tWASH7P\tEntrezGene\tHGNC:38034\ttranscribed_pseudogene\t-\t-\t-\t-\t-\t-\t-\t-\t-\tRefSeq\tG\tG\tOK\t-\t-\t-\t-\t8/10\t-\t-\tNR_024540.1:n.1080+112C>G\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\n',
'rs141130360\tchr1:16495\tC\t102466751\tNG_106918.1\tTranscript\tdownstream_gene_variant\t-\t-\t-\t-\t-\trs3210724\tG\tMODIFIER\t874\t-1\t-\tSNV\tMIR6859-1\tEntrezGene\tHGNC:50039\tmiRNA\t-\t-\t-\t-\t-\t-\t-\t-\t-\tRefSeq\tG\tG\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\n',
'rs141130360\tchr1:16495\tC\t100287102\tNR_046018.2\tTranscript\tdownstream_gene_variant\t-\t-\t-\t-\t-\trs3210724\tG\tMODIFIER\t2086\t1\t-\tSNV\tDDX11L1\tEntrezGene\tHGNC:37102\ttranscribed_pseudogene\t-\t-\t-\t-\t-\t-\t-\t-\t-\tRefSeq\tG\tG\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\n']
推荐答案
我宁愿将ListB
转换为[value, index]
字典,然后创建一个函数,该函数从字符串中提取值并在dict中查找它.那就是我们sorted
的key
函数.
I would instead convert ListB
to a [value, index]
dictionary, then create a function that extracts the value from the string and looks it up in the dict. That will be our key
function for sorted
.
d = {x: i for i, x in enumerate(ListB)}
def get_index(s):
by_tabs = s.split('\t')
by_underscore = by_tabs[4].split('_')
return d[by_underscore[0]]
listC = sorted(ListA, key=get_index)
这篇关于给定另一个包含了来自ListA的部分字符串的ListB,对TSV字符串的ListA进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!