词干被删除后,是否有可能得到一个自然的词? [英] Is it possible to get a natural word after it has been stemmed?

查看:110
本文介绍了词干被删除后,是否有可能得到一个自然的词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 play 这个词,词干后变成了 plai .现在,我想再次 播放 .是否有可能?我用过波特的茎.

I have a word play which after stemming has become plai. Now I want to get play again. Is it possible? I have used Porter's Stemmer.

推荐答案

Stemmer能够处理人工不存在的单词.您是否希望将它们作为一组所有可能单词的元素返回?您怎么知道这个词不存在,也不应该返回?

Stemmer is able to process artificial non-existing words. Would you like them to be returned as elements of a set of all possible words? How do you know that the word doesn't exist and shouldn't be returned?

作为一种选择:查找所有单词及其形式的字典.为他们每个人找到一个茎.将此投影另存为地图:(词干,所有单词形式的列表).这样您就可以获取给定词干的所有单词形式的列表.

As an option: find a dictionary of all words and their forms. Find a stem for every of them. Save this projection as a map: ( stem, list of all word forms ). So you'll be able to get the list of all word forms for a given stem.

UPD: 如果您需要所有可能的单词,包括不存在的单词,那么我可以提供这样的算法(未经检查,只是一个建议):

UPD: If you need all possible words including non-existing then I can offer such an algorithm (it's not checked, just a suggestion):

波特词干算法.我们需要一个反向版本.

Porter stemming algorithm. We need a reversed version.

如果直接算法中的规则的格式为(m>1) E ->(删除最后一个E),则相反的规则将是"E叉",这意味着我们需要尝试其他方法.例如,在直接算法probate -> probat中,在反向算法中,我们有两个选择:probat -> { probat, probate }.这些替代方案中的每一个都应进一步单独处理.请注意,这是替代的 set 集,因此我们将仅处理不同的单词.这样的规则将具有以下形式:A -> { , B, C },表示以三种替代方式替换结尾的A:按原样保留B和C".

If the rule in straight algorithm has a form (m>1) E -> (delete last E) then the reversed rule would be "fork with E" which means we need to try alternative ways. E.g., in straight algorithm probate -> probat, in reversed we have two alternatives: probat -> { probat, probate }. Each of these alternatives should be separately processed further. Note that this is a set of alternatives, so we will process only distinct words. Such a rule would have the following form: A -> { , B, C }, which means "replace ending A in three alternative ways: leave as-is, with B and with C".

Step 5b: (m>1) *L -> { , +L } // Add L if there's L at the end.
Step 5a: (m>1) -> { , +E }
         (m=1 and not *o) -> { , +E } // *o is a special condition, it's not *O.
Step 4: (m>1) *S or *T -> { , +ION }
        (m>1) -> { , +AL, +ANCE, +ENCE, ..., +IVE, +IZE }
Step 3: (m>0) *AL -> { , +IZE }
        (m>0) *IC -> { , +ATE, +ITI, +AL }
        (m>0) -> { , +ATIVE, +FUL, +NESS }
Step 2: (m>0) *ATE -> { , ATIONAL } // Replace ATE.
        (m>0) *TION -> { , +AL } // Add AL at the end.
        (m>0) *ENCE -> { , ENCI } // Replace ENCE.
        ...
        (m>0) *BLE -> { , BILITI } // Replace BLE.
Step 1c: (*v*) *I -> { , Y } // Replace I.
Step 1b: (m=1 and *oE) -> { , +D, delete last E and add ING } // *o is a special condition.
         (*v*c and not (*L or *S or *Z)) -> { , add last consonant +ED, add last consonant + ING }
         *IZE -> { , IZING, +D }
         (*v*BLE) -> { , +D, delete last E and add ING }
         *ATE -> { , ATING, +D }
         (*v*) -> { , +ED, +ING }
         (m>0) *EE -> { , +D }
Step 1a: *I -> { , +ES }
         *SS -> { , +ES }
         not *S -> { , +S }

直接算法必须选择第一个最长规则.逆向算法应使用所有规则.

The straight algorithm had to choose first longest rule. The reversed algorithm should use all the rules.

示例(直接):

Input: PLAYING
Step 1a doesn't match.
PLAYING -> PLAY (Step 1b)
PLAY -> PLAI (Step 1c)
m=0, so the steps 2-5 don't match.
Result: PLAI

反面:

Input: PLAI
m=0, so the steps 2-5 are skipped
Step 1c:
PLAI -> { PLAI, PLAY }
Step 1b:
PLAI -> { PLAI, PLAIED, PLAIING }
PLAY -> { PLAY, PLAYED, PLAYING }
Resulting set: { PLAI, PLAIED, PLAIING, PLAY, PLAYED, PLAYING }
Step 1a:
PLAI -> { PLAI, PLAIS, PLAIES }
PLAIED -> { PLAIED, PLAIEDS }
PLAIING -> { PLAIING, PLAIINGS }
PLAY -> { PLAY, PLAYS }
PLAYED -> { PLAYED, PLAYEDS }
PLAYING -> { PLAYING, PLAYINGS }
Resulting set: { PLAI, PLAIS, PLAIES, PLAIED, PLAIEDS, PLAIING, PLAIINGS, PLAY, PLAYS, PLAYED, PLAYEDS, PLAYING, PLAYINGS }

我已经在 Michael Tontchev 的链接中检查了所有这些词.他们每个人的结果都是"plai"(请注意,该网站不接受大写输入).

I've checked all those words at Michael Tontchev's link. The result for every of them is "plai" (note that the site doesn't accept upper-case input).

这篇关于词干被删除后,是否有可能得到一个自然的词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆