波特施特默算法不返回预期的输出?当修改成高清 [英] Porter Stemmer Algorithm Not returning the expected output? when modified into def

查看:163
本文介绍了波特施特默算法不返回预期的输出?当修改成高清的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 PorterStemmer Python的端口

波特词干算法(或波特词干')是从英文字取出布衣形态和抑扬的结局的过程。它的主要用途是作为一个术语规范化的过程建立的信息检索系统时,通常做的一部分。

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

有关以下。

您需要做的另一件事是减少每个单词的茎。例如,字   都具有相同的干,这是。有一个合理的接受的方法来做到这一点,这就是所谓的波特   算法。您可以下载的东西从 http://tartarus.org/martin/PorterStemmer/ 执行它。

The other thing you need to do is reduce each word to its stem. For example, the words sing, sings, singing all have the same stem, which is sing. There is a reasonably accepted way to do this, which is called Porter's algorithm. You can download something that performs it from http://tartarus.org/martin/PorterStemmer/.

和我已经修改了code ..

And I've modified the code..

if __name__ == '__main__':
    p = PorterStemmer()
    if len(sys.argv) > 1:
        for f in sys.argv[1:]:
            infile = open(f, 'r')
            while 1:
                output = ''
                word = ''
                line = infile.readline()
                if line == '':
                    break
                for c in line:
                    if c.isalpha():
                        word += c.lower()
                    else:
                        if word:
                            output += p.stem(word, 0,len(word)-1)
                            word = ''
                        output += c.lower()
                print output,
            infile.close()

要从输入并从preprocessed字符串不是一个文件中读取并返回输出。

To read from an input and not a file from a preprocessed string and return the output.

def algorithm(input):
    p = PorterStemmer()
    while 1:
        output = ''
        word = ''
        if input == '':
            break
        for c in input:
            if c.isalpha():
                word += c.lower()
            else:
                if word:
                    output += p.stem(word, 0,len(word)-1)
                    word = ''
                output += c.lower()
        return output

请注意,如果我定位我的返回输出到相同的缩进,而1:它变成一个无限循环

Note if I position my return output onto the same indent as while 1: it turns into an infinite loop.

使用方法(例)

import PorterStemmer as ps
ps.algorithm("Michael is Singing");

输出

迈克尔

预计产量

迈克尔唱

我是什么做错了吗?

推荐答案

所以看起来罪魁祸首是它目前不写输入的最后一部分,以输出(试行迈克尔在歌唱的东西,例如 - 它应该正确地写了一切,省略东西)。有可能是一个更优雅的方式来处理这个问题,但有一件事你可以尝试是增加一个其他子句的循环。既然问题是一锤定音没有被包含在输出,我们可以使用其他来确保该完成了循环后一锤定音被添加:

So it looks like the culprit is that it doesn't currently write the final part of the input to output (try "Michael is Singing stuff", for example - it should write everything correctly and omit 'stuff'). There is likely a more elegant way to handle this, but one thing you could try is adding an else clause to the for loop. Since the issue is that the final word is not being included in output, we can use else to make sure that the final word gets added upon the completion of the for loop:

def algorithm(input):
    print input
    p = PorterStemmer()
    while 1:
        output = ''
        word = ''
        if input == '':
            break
        for c in input:
            if c.isalpha():
                word += c.lower()
            elif word:
                output += p.stem(word, 0,len(word)-1)
                word = ''
                output += c.lower()
        else:
            output += p.stem(word, 0, len(word)-1)  
        print output
        return output

这已经被广泛用两个测试用例进行测试,所以显然是刀枪不入:)可能有一些优势的情况下爬来爬去存在,但希望这将让你开始。

This has been extensively tested with two test cases, so clearly it is bulletproof :) There are probably some edge cases crawling around there, but hopefully it will get you started.

这篇关于波特施特默算法不返回预期的输出?当修改成高清的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆