查找句子字符串中单词的其他实现-Python [英] Find different realization of a word in a sentence string - Python
问题描述
(这个问题是关于一般的字符串检查,而不是自然语言处理本身,但是如果您将其视为NLP问题,请想象一下,为了简单起见,当前的分析器可以分析的不是语言,我将使用英文字符串,例如)
让我们说一个单词只能以6种可能的形式实现
lets say there are only 6 possible form that a word can be realized in
- 首字母大写
- 其复数形式带有"s"
- 带有"es"的复数形式
- 大写+"es"
- 大写+"s"
- 没有复数形式或大写形式的基本形式
假设我要查找第一个实例的索引,单词coach
的任何形式出现在句子中,是否有一种更简单的方式来执行这两种方法:
let's say i want to find the index of the 1st instance any form of the word coach
occurs in a sentence, is there a simpler way of doing these 2 methods:
如果条件长
sentence = "this is a sentence with the Coaches"
target = "coach"
print target.capitalize()
for j, i in enumerate(sentence.split(" ")):
if i == target.capitalize() or i == target.capitalize()+"es" or \
i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
i == target:
print j
迭代try-except
variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]
ind = 0
for i in variations:
try:
j == sentence.split(" ").index(i)
print j
except ValueError:
continue
推荐答案
我建议看看NLTK的词干包: http://nltk.org/api/nltk.stem.html
I recommend having a look at the stem package of NLTK: http://nltk.org/api/nltk.stem.html
使用它,您可以从单词中删除词缀,仅保留单词词干.词干算法旨在删除语法作用,时态,派生词法所需要的那些词缀,仅保留单词的词干."
Using it you can "remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word."
如果您的语言当前不属于NLTK,则应考虑扩展NLTK.如果您确实需要一些简单的东西并且不打扰NLTK,那么您仍然应该将代码编写为小的,易于组合的实用程序函数的集合,例如:
If your language is not covered by NLTK currently, you should consider extending NLTK. If you really need something simple and don't bother about NLTK, then you should still write your code as a collection of small, easy to combine utility functions, for example:
import string
def variation(stem, word):
return word.lower() in [stem, stem + 'es', stem + 's']
def variations(sentence, stem):
sentence = cleanPunctuation(sentence).split()
return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )
def cleanPunctuation(sentence):
exclude = set(string.punctuation)
return ''.join(ch for ch in sentence if ch not in exclude)
def firstVariation(sentence, stem):
for i, w in variations(sentence, stem):
return i, w
sentence = "First coach, here another two coaches. Coaches are nice."
print firstVariation(sentence, 'coach')
# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])
这篇关于查找句子字符串中单词的其他实现-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!