有没有办法找到pharse并捕获下一个令牌值 [英] Is there way to find the pharse and capture next token value
问题描述
所以我在服务器上有一个文件:
So I have a file of this on the server:
COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T
COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T
COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N
COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N
我的目标是找到id(P17544),该ID在文件和捕获/存储的第5列(我稍后需要打印该数字)在其后面的令牌的数量为436(该数字应为位于第6列中A436T的两个字母之间). 有什么办法可以做到这一点?之前我曾与lxml一起工作过一点,但仍不确定如何执行此操作.预先感谢
my goal is find the id (P17544), which in column 5 of the file and capture/store(which i need to print that number later) the number of the token behind it which is 436(this number is suppose to be in between two letter) from A436T in column 6. Is there way that I can do this? I worked a little bit with lxml before but still not sure how to do this. thanks in advance
这就是我所拥有的
file = open('text.txt','r')
file = open('text.txt','r')
lookup = {}
lookup={}
对于文件中的行:
myid, token = file.rsplit(' ', 2)[1:]
token = token[1:-1]
推荐答案
使用内置str
方法的简单方法 :
Simplest method using builtin str
methods:
d = 'COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T'
myid, token = d.rsplit(' ', 2)[1:] # will except if can't be unpacked so you know you've got exactly 2 elements...
token = token[1:-1]
但是,如果要在两个字母之间指定数字,则可以使用正则表达式... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...
You could use regular expressions though if you wanted to specify numbers between two letters... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...
说明:
d.rsplit(' ', 2)
-从返回['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T']
的末尾开始在' '
s处拆分字符串.假设我们只在寻找最后2个元素,我们用切片将第一个元素删除,因此我们得到d.rsplit(' ', 2)[1:]
,它给出了['P17544', 'A436T']
.
d.rsplit(' ', 2)
- starts splitting the string at ' '
s from the end which returns ['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T']
. Assuming we're only looking for the last 2 elements, we remove the first one with a slice, so we get d.rsplit(' ', 2)[1:]
which gives ['P17544', 'A436T']
.
使用拆包,我们为变量命名,并通过使用myid, token = d.rsplit(' ', 2)[1:]
来确保其长度为2-如果它不完全具有两个元素,则分配将失败.
Using unpacking, we name our variables and also guarantee it has a length of two by using myid, token = d.rsplit(' ', 2)[1:]
- if it didn't have exactly two elements, the assignment will fail.
现在myid
应该是您的ID,希望您使用切片token = token[1:-1]
从令牌中删除第一个和最后一个字符.
Now that myid
should be your id that you want you remove the first and last character from token using slicing which is token = token[1:-1]
.
然后:
print myid, token
# P17544 436
有关查找的评论:
用于在分析文件的行后进行查找:
For looking up after parsing the lines of the file:
lookup = {}
for line in file:
# do steps above so you have myid, token
lookup[myid] = token
然后lookup ['P17544']将返回'436'
Then lookup['P17544'] will return '436'
希望更清晰...
这篇关于有没有办法找到pharse并捕获下一个令牌值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!