如何获得特定令牌前后的单词? [英] How can I get words after and before a specific token?

查看:103
本文介绍了如何获得特定令牌前后的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前在一个项目中工作,该项目只是创建基本的语料库数据库并标记文本.但似乎我陷入了困境.假设我们有这些东西:

 import os, re

texts = []

for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
    with open(i, 'r') as f:
        texts.append(f.read())
 

现在我想在标记之前和之后找到单词.

myToken = 'blue'
found = []
for i in texts:
    fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
    found.extend(fnd)

print myToken
for i in found:
    print '\t\t%s' %(i)

我认为可能存在三种可能性:令牌可能会在句子的开头,令牌可能会在句子的结尾或令牌可能会出现在句子中的某个位置,所以我使用了上面的regex规则.跑步时,我遇到了这些事情:

blue
    My blue car # What I exactly want.
    he blue jac # That's not what I want. That must be "the blue jacket."
    eir blue phone # Wrong! > their
    a blue ali # Wrong! > alien
    . Blue is # Okay.
    is blue. # Okay.
    ...

我也尝试了\ b \ w \ b或\ b \ W \ b东西,但是不幸的是,这些东西没有返回任何结果,而是返回了错误的结果.我试过了:

'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'

我希望问题不要太模糊.

解决方案

我认为您想要的是:

  1. (可选)单词和空格;
  2. (总是)'blue';
  3. (可选)一个空格和一个单词.

因此,一个合适的正则表达式将是:

r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'

例如:

>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']

请参见此处.

I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things:

import os, re

texts = []

for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
    with open(i, 'r') as f:
        texts.append(f.read())

Now I want to find the word before and after a token.

myToken = 'blue'
found = []
for i in texts:
    fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
    found.extend(fnd)

print myToken
for i in found:
    print '\t\t%s' %(i)

I thought there would be three possibilities: The token might start sentence, the token might end sentence or the token might appear somewhere in the sentence, so I used the regex rule above. When I run, I come across those things:

blue
    My blue car # What I exactly want.
    he blue jac # That's not what I want. That must be "the blue jacket."
    eir blue phone # Wrong! > their
    a blue ali # Wrong! > alien
    . Blue is # Okay.
    is blue. # Okay.
    ...

I also tried \b\w\b or \b\W\b things, but unfortunately those did not return any results instead of returning wrong results. I tried:

'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'

I hope question is not too blur.

解决方案

I think what you want is:

  1. (Optionally) a word and a space;
  2. (Always) 'blue';
  3. (Optionally) a space and a word.

Therefore one appropriate regex would be:

r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'

For example:

>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']

See demo and token-by-token explanation here.

这篇关于如何获得特定令牌前后的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆