在Python中搜索Unicode字符 [英] Searching for Unicode characters in Python

查看:172
本文介绍了在Python中搜索Unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在基于Python/NLTK的NLP项目中使用非英语unicode文本.为此,我需要在句子中搜索unicode字符串.

I'm working on a NLP project based on Python/NLTK with non-english unicode text. For that, I need to search unicode string inside a sentence.

有一个 .txt 文件,其中保存了一些非英语的unicode句子.我使用NLTK PunktSentenceTokenizer 破坏了它们并将其保存在python列表中.

There is a .txt file saved with some non-english unicode sentences. Using NLTK PunktSentenceTokenizer i broke them and saved in a python list.

sentences = PunktSentenceTokenizer().tokenize(text)

现在,我可以遍历列表并分别获取每个sentence.

Now i can iterate through list and get each sentence separately.

我需要做的是通过sentence并确定哪个单词具有给定的unicode字符.

What i need to do is go through that sentence and identify which word has the given unicode characters.

示例-

sentence = 'AASFG BBBSDC FEKGG SDFGF'

假设上面的文本是非英语unicode,我需要找到以GF结尾的单词,然后返回整个单词(可能是该单词的索引).

Assume above text is non-english unicode and i need to find words ending with GF then return whole word (may be index of that word).

search = 'SDFGF'

类似地,我需要找到以BB开头的单词.

Similarly i need to find words starting with BB get the word of it.

search2 = 'BBBSDC'

推荐答案

如果我理解正确,您只需要将句子拆分为单词,循环遍历每个单词,并检查它是否以所需字符结尾或以例如字符开头:

If I understand correctly, you just have to split up the sentence into words, loop over each one and check if it ends or starts with the required characters, e.g:

>>> sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']
>>> [word for word in sentence.split() if word.endswith("GF")]
['SDFGF']

sentence.split()可能会替换为nltk.tokenize.word_tokenize(sentence)

更新,关于评论:

如何在单词的前面和后面得到单词

How can get word in-front of that and behind it

enumerate函数可用于为每个单词赋予一个数字,如下所示:

The enumerate function can be used to give each word a number, like this:

>>> print list(enumerate(sentence))
[(0, 'AASFG'), (1, 'BBBSDC'), (2, 'FEKGG'), (3, 'SDFGF')]

然后,如果您执行相同的循环,但保留索引:

Then if you do the same loop, but preserve the index:

>>> results = [(idx, word) for (idx, word) in enumerate(sentence) if word.endswith("GG")]
>>> print results
[(2, 'FEKGG')]

..您可以使用索引获取下一个或上一个项目:

..you can use the index to get the next or previous item:

>>> for r in results:
...     r_idx = r[0]
...     print "Prev", sentence[r_idx-1]
...     print "Next", sentence[r_idx+1]
...
Prev BBBSDC
Next SDFGF

您需要处理第一个或最后一个单词(if r_idx == 0if r_idx == len(sentence))匹配的情况

You'd need to handle the case where the match the very first or last word (if r_idx == 0, if r_idx == len(sentence))

这篇关于在Python中搜索Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆