检索动词源于动词列表 [英] Retrieving verb stems from a list of verbs

查看:86
本文介绍了检索动词源于动词列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个所有动词的字符串列表.我需要获取每个动词的词频,但是我想将想要",想要",想要"和想要"等动词作为一个动词进行计数.形式上,动词"定义为一组四个单词,其形式为{X,Xs,Xed,Xing}或形式为{Xe,Xes,Xed,Xing}.我将如何从列表中提取动词,以便得到"X"和词干出现多少次的计数?我以为我可以以某种方式使用正则表达式,但我是正则表达式n00b,完全迷失了

I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a "verb" is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {Xe, Xes, Xed, Xing}. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, but I'm a regex n00b and am totally lost

推荐答案

有一个名为 nltk 的库疯狂的文本处理功能数组.函数的子集之一是stemmers,它可以完成您想要的操作(使用该领域具有丰富经验的人开发的算法/代码).这是使用 Porter Stemming 算法的结果:

There is a library called nltk which has an insane array of functions for text processing. One of the subsets of functions are stemmers, which do just what you want (using algorithms/code developed by people with a lot of experience in the area). Here is the result using the Porter Stemming algorithm:

In [3]: import nltk

In [4]: verbs = ["want", "wants", "wanting", "wanted"]

In [5]: for verb in verbs:
   ...:     print nltk.stem.porter.PorterStemmer().stem_word(verb)
   ...:     
want
want
want
want

您可以将其与defaultdict结合使用来做类似的事情(注意:在Python 2.7+中,Counter会同样有用/更好):

You could use this in conjunction with a defaultdict to do something like this (note: in Python 2.7+, a Counter would be equally useful/better):

In [2]: from collections import defaultdict

In [3]: from nltk.stem.porter import PorterStemmer

In [4]: verbs = ["want", "wants", "wanting", "wanted", "running", "runs", "run"]

In [5]: freq = defaultdict(int)

In [6]: for verb in verbs:
   ...:     freq[PorterStemmer().stem_word(verb)] += 1
   ...:     

In [7]: freq
Out[7]: defaultdict(<type 'int'>, {'run': 3, 'want': 4})

要注意的一件事:茎杆并不是完美的-例如,在上面添加ran会得到以下结果:

One thing to note: the stemmers aren't perfect - for instance, adding ran to the above yields this as the result:

defaultdict(<type 'int'>, {'ran': 1, 'run': 3, 'want': 4})

不过,希望它能使您接近想要的东西.

However hopefully it will get you close to what you want.

这篇关于检索动词源于动词列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆