(生物医学)词干的所有可能的字形补全 [英] all possible wordform completions of a (biomedical) word's stem
问题描述
我熟悉R中tm包中的词干和补全.
I'm familiar with word stemming and completion from the tm package in R.
我试图提出一种快速而又肮脏的方法来查找给定单词(在某个语料库内)的所有变体.例如,如果我的输入是白细胞".
I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".
如果我现在必须这样做,我可能会选择类似的东西:
If I had to do it right now, I would probably just go with something like:
library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"),
ignore.case = T, x = dictionary, value = T)
我之所以使用Lovins是因为Snowball的Porter似乎不够积极.
I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.
我愿意接受其他词干,脚本语言(Python?)或完全不同的方法的建议.
I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.
推荐答案
此解决方案需要预处理您的语料库.但是一旦完成,这将是一个非常快速的字典查找.
This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.
from collections import defaultdict
from stemming.porter2 import stem
with open('/usr/share/dict/words') as f:
words = f.read().splitlines()
stems = defaultdict(list)
for word in words:
word_stem = stem(word)
stems[word_stem].append(word)
if __name__ == '__main__':
word = 'leukocyte'
word_stem = stem(word)
print(stems[word_stem])
对于/usr/share/dict/words
语料库,这将产生结果
For the /usr/share/dict/words
corpus, this produces the result
['leukocyte', "leukocyte's", 'leukocytes']
它使用可以与
stemming
模块>
It uses the stemming
module that can be installed with
pip install stemming
这篇关于(生物医学)词干的所有可能的字形补全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!