产生实词的词干算法 [英] Stemming algorithm that produces real words

查看:165
本文介绍了产生实词的词干算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一段文本并从中提取标签"列表.其中大多数是非常简单的.但是,我现在需要一些帮助来阻止生成的单词列表,以避免重复.示例:社区/社区

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

我使用了Porter Stemmer算法的实现(顺便说一下,我用PHP编写):

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):

http://tartarus.org/~martin/PorterStemmer/php.txt

这在某种程度上可行,但不会返回真实"字词.上面的示例源于"commun".

This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

我已经尝试过"Snowball"(建议在另一个Stack Overflow线程中使用).

I've tried "Snowball" (suggested within another Stack Overflow thread).

http://snowball.tartarus.org/demo.php

以我的示例(社区/社区)为例,Snowball源自"communiti".

For my example (community / communities), Snowball stems to "communiti".

问题

是否还有其他干算法可以做到这一点?还有其他人解决过这个问题吗?

Are there any other stemming algorithms that will do this? Has anyone else solved this problem?

我目前的想法是,我可以使用词干提取算法避免重复,然后选择遇到的最短单词作为要显示的实际单词.

推荐答案

此处的核心问题是,词干提取算法仅根据语言的拼写规则在语音的基础上对进行操作,而没有实际的理解.他们正在使用的语言.要生成真实词,您可能必须将词干提取器的输出与某种形式的查找功能合并,以将词干转换回真实词.我基本上可以看到两种可能的方法:

The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

  1. 找到或创建一个大型词典,该词典将每个可能的词干映射回一个实际单词. (例如,社区->社区)
  2. 创建一个函数,将每个词干与简化为该词干的单词列表进行比较,并尝试确定最相似的词. (例如,将社区"与社区"和社区"进行比较,以使社区"被认为是更相似的选择)

我个人认为,我的处理方式将是动态的#1形式,通过记录所检查的每个单词及其词干,然后假设最常见的单词是一个单词,来建立自定义词典数据库应该使用的. (例如,如果我的原文中使用社区"的频率比使用社区"的频率高,则映射社区->社区".)基于字典的方法通常会更准确,并且基于词干输入法构建的方法将提供结果根据您的文本进行定制,主要缺点是所需空间,这通常在当今已经不是问题.

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

这篇关于产生实词的词干算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆