有没有办法找到pharse并捕获下一个令牌值 [英] Is there way to find the pharse and capture next token value

查看:166
本文介绍了有没有办法找到pharse并捕获下一个令牌值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我在服务器上有一个文件:

So I have a file of this on the server:

COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T

COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T

COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N

COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N

我的目标是找到id(P17544),该ID在文件和捕获/存储的第5列(我稍后需要打印该数字)在其后面的令牌的数量为436(该数字应为位于第6列中A436T的两个字母之间). 有什么办法可以做到这一点?之前我曾与lxml一起工作过一点,但仍不确定如何执行此操作.预先感谢

my goal is find the id (P17544), which in column 5 of the file and capture/store(which i need to print that number later) the number of the token behind it which is 436(this number is suppose to be in between two letter) from A436T in column 6. Is there way that I can do this? I worked a little bit with lxml before but still not sure how to do this. thanks in advance

这就是我所拥有的

file = open('text.txt','r')

file = open('text.txt','r')

lookup = {}

lookup={}

对于文件中的行:

myid, token = file.rsplit(' ', 2)[1:]

token = token[1:-1] 

推荐答案

使用内置str方法的简单方法 :

Simplest method using builtin str methods:

d = 'COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T'
myid, token = d.rsplit(' ', 2)[1:] # will except if can't be unpacked so you know you've got exactly 2 elements...
token = token[1:-1]

但是,如果要在两个字母之间指定数字,则可以使用正则表达式... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...

You could use regular expressions though if you wanted to specify numbers between two letters... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...

说明:

d.rsplit(' ', 2)-从返回['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T'] 的末尾开始在' ' s处拆分字符串.假设我们只在寻找最后2个元素,我们用切片将第一个元素删除,因此我们得到d.rsplit(' ', 2)[1:],它给出了['P17544', 'A436T'].

d.rsplit(' ', 2) - starts splitting the string at ' 's from the end which returns ['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T'] . Assuming we're only looking for the last 2 elements, we remove the first one with a slice, so we get d.rsplit(' ', 2)[1:] which gives ['P17544', 'A436T'].

使用拆包,我们为变量命名,并通过使用myid, token = d.rsplit(' ', 2)[1:]来确保其长度为2-如果它不完全具有两个元素,则分配将失败.

Using unpacking, we name our variables and also guarantee it has a length of two by using myid, token = d.rsplit(' ', 2)[1:] - if it didn't have exactly two elements, the assignment will fail.

现在myid应该是您的ID,希望您使用切片token = token[1:-1]从令牌中删除第一个和最后一个字符.

Now that myid should be your id that you want you remove the first and last character from token using slicing which is token = token[1:-1].

然后:

print myid, token
# P17544 436

有关查找的评论:

用于在分析文件的行后进行查找:

For looking up after parsing the lines of the file:

lookup = {}
for line in file:
    # do steps above so you have myid, token
    lookup[myid] = token

然后lookup ['P17544']将返回'436'

Then lookup['P17544'] will return '436'

希望更清晰...

这篇关于有没有办法找到pharse并捕获下一个令牌值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆