查找句子字符串中单词的其他实现-Python [英] Find different realization of a word in a sentence string - Python

查看:130
本文介绍了查找句子字符串中单词的其他实现-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(这个问题是关于一般的字符串检查,而不是自然语言处理本身,但是如果您将其视为NLP问题,请想象一下,为了简单起见,当前的分析器可以分析的不是语言,我将使用英文字符串,例如)

让我们说一个单词只能以6种可能的形式实现

lets say there are only 6 possible form that a word can be realized in

  1. 首字母大写
  2. 其复数形式带有"s"
  3. 带有"es"的复数形式
  4. 大写+"es"
  5. 大写+"s"
  6. 没有复数形式或大写形式的基本形式

假设我要查找第一个实例的索引,单词coach的任何形式出现在句子中,是否有一种更简单的方式来执行这两种方法:

let's say i want to find the index of the 1st instance any form of the word coach occurs in a sentence, is there a simpler way of doing these 2 methods:

如果条件长

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

迭代try-except

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue

推荐答案

我建议看看NLTK的词干包: http://nltk.org/api/nltk.stem.html

I recommend having a look at the stem package of NLTK: http://nltk.org/api/nltk.stem.html

使用它,您可以从单词中删除词缀,仅保留单词词干.词干算法旨在删除语法作用,时态,派生词法所需要的那些词缀,仅保留单词的词干."

Using it you can "remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word."

如果您的语言当前不属于NLTK,则应考虑扩展NLTK.如果您确实需要一些简单的东西并且不打扰NLTK,那么您仍然应该将代码编写为小的,易于组合的实用程序函数的集合,例如:

If your language is not covered by NLTK currently, you should consider extending NLTK. If you really need something simple and don't bother about NLTK, then you should still write your code as a collection of small, easy to combine utility functions, for example:

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

这篇关于查找句子字符串中单词的其他实现-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆