(生物医学)词干的所有可能的字形补全 [英] all possible wordform completions of a (biomedical) word's stem

查看:70
本文介绍了(生物医学)词干的所有可能的字形补全的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉R中tm包中的词干和补全.

I'm familiar with word stemming and completion from the tm package in R.

我试图提出一种快速而又肮脏的方法来查找给定单词(在某个语料库内)的所有变体.例如,如果我的输入是白细胞".

I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".

如果我现在必须这样做,我可能会选择类似的东西:

If I had to do it right now, I would probably just go with something like:

library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"), 
    ignore.case = T, x = dictionary, value = T)

我之所以使用Lovins是因为Snowball的Porter似乎不够积极.

I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.

我愿意接受其他词干,脚本语言(Python?)或完全不同的方法的建议.

I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.

推荐答案

此解决方案需要预处理您的语料库.但是一旦完成,这将是一个非常快速的字典查找.

This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.

from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

对于/usr/share/dict/words语料库,这将产生结果

For the /usr/share/dict/words corpus, this produces the result

['leukocyte', "leukocyte's", 'leukocytes']

它使用可以与 stemming 模块>

It uses the stemming module that can be installed with

pip install stemming

这篇关于(生物医学)词干的所有可能的字形补全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆