检索动词源于动词列表 [英] Retrieving verb stems from a list of verbs
问题描述
我有一个所有动词的字符串列表.我需要获取每个动词的词频,但是我想将想要",想要",想要"和想要"等动词作为一个动词进行计数.形式上,动词"定义为一组四个单词,其形式为{X,Xs,Xed,Xing}或形式为{Xe,Xes,Xed,Xing}.我将如何从列表中提取动词,以便得到"X"和词干出现多少次的计数?我以为我可以以某种方式使用正则表达式,但我是正则表达式n00b,完全迷失了
I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a "verb" is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {Xe, Xes, Xed, Xing}. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, but I'm a regex n00b and am totally lost
推荐答案
有一个名为 nltk 的库疯狂的文本处理功能数组.函数的子集之一是stemmers
,它可以完成您想要的操作(使用该领域具有丰富经验的人开发的算法/代码).这是使用 Porter Stemming 算法的结果:
There is a library called nltk which has an insane array of functions for text processing. One of the subsets of functions are stemmers
, which do just what you want (using algorithms/code developed by people with a lot of experience in the area). Here is the result using the Porter Stemming algorithm:
In [3]: import nltk
In [4]: verbs = ["want", "wants", "wanting", "wanted"]
In [5]: for verb in verbs:
...: print nltk.stem.porter.PorterStemmer().stem_word(verb)
...:
want
want
want
want
您可以将其与defaultdict
结合使用来做类似的事情(注意:在Python 2.7+中,Counter
会同样有用/更好):
You could use this in conjunction with a defaultdict
to do something like this (note: in Python 2.7+, a Counter
would be equally useful/better):
In [2]: from collections import defaultdict
In [3]: from nltk.stem.porter import PorterStemmer
In [4]: verbs = ["want", "wants", "wanting", "wanted", "running", "runs", "run"]
In [5]: freq = defaultdict(int)
In [6]: for verb in verbs:
...: freq[PorterStemmer().stem_word(verb)] += 1
...:
In [7]: freq
Out[7]: defaultdict(<type 'int'>, {'run': 3, 'want': 4})
要注意的一件事:茎杆并不是完美的-例如,在上面添加ran
会得到以下结果:
One thing to note: the stemmers aren't perfect - for instance, adding ran
to the above yields this as the result:
defaultdict(<type 'int'>, {'ran': 1, 'run': 3, 'want': 4})
不过,希望它能使您接近想要的东西.
However hopefully it will get you close to what you want.
这篇关于检索动词源于动词列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!