Porter和Lancaster Stemming算法的主要区别和好处是什么? [英] What are the major differences and benefits of Porter and Lancaster Stemming algorithms?

查看:1258
本文介绍了Porter和Lancaster Stemming算法的主要区别和好处是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究java中的文档分类任务。



强烈推荐这两种算法,每种算法的优点和缺点是什么,哪种更常用于自然语言处理任务的文献?

解决方案

在它的基础知识上,porter和lancaster阻塞算法的主要区别在于兰卡斯特干扰器明显更多比搬运工干扰器更具侵略性。目前使用的三种主要阻塞算法是Porter,Snowball(Porter2)和Lancaster(Paice-Husk),其攻击性连续性基本沿着这些相同的路线。 Porter是最不具侵略性的算法,每种算法的细节实际上都是相当冗长和技术性的。这里有一个细分:



波特:最常用的词干器毫无疑问,也是最温和的词干器之一。实际上具有Java支持的少数词干提取器之一是一个优点,尽管它也是算法中计算密集程度最高的(但不是非常显着)。它也是最古老的词干算法。



波特2:几乎普遍认为是对搬运工的改进,并且有充分的理由。实际上,波特自己承认它比他原来的算法更好。比搬运工的计算时间略快,周围有一个相当大的社区。

兰卡斯特:非常积极的词干算法,有时是故障。对于搬运工和雪球来说,词干表示对于读者而言通常是相当直观的,对于兰开斯特来说并非如此,因为许多较短的单词将变得完全混淆。这里最快的算法,会大大减少你的工作词组,但如果你想要更多的区别,而不是你想要的工具。



老实说,我觉得Snowball通常是要走的路。兰卡斯特在某些情况下会严重削减你的工作组,这可能非常有用,但在我看来,雪球的边际速度增加并不值得缺乏精确度。虽然Porter拥有最多的实现,因此通常是默认的go-to算法,但如果可以,请使用雪球。


I'm Working on document classification tasks in java.

Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks?

解决方案

At the very basics of it, the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer. The three major stemming algorithms in use today are Porter, Snowball(Porter2), and Lancaster (Paice-Husk), with the aggressiveness continuum basically following along those same lines. Porter is the least aggressive algorithm, with the specifics of each algorithm actually being fairly lengthy and technical. Here is a break down for you though:

Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.

Porter2: Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than porter, with a fairly large community around it.

Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

Honestly, I feel that Snowball is usually the way to go. There are certain circumstances in which Lancaster will hugely trim down your working set, which can be very useful, however the marginal speed increase over snowball in my opinion is not worth the lack of precision. Porter has the most implementations though and so is usually the default go-to algorithm, but if you can, use snowball.

这篇关于Porter和Lancaster Stemming算法的主要区别和好处是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆