使用动态编程进行分词 [英] Word segmentation using dynamic programming

查看:111
本文介绍了使用动态编程进行分词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以首先,我对Python还是很陌生,所以如果我做的事情很糟糕,我会在这篇文章的开头加上一个抱歉的词.我被分配了这个问题:

So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:

我们想设计一个动态编程解决方案来解决以下问题:存在一个字符串字符串,该字符串可能是一个单词序列,所有空格都已删除,并且我们想找到一种方法,如果有的话,可以插入空格以分隔有效的英语单词.例如,您的发明可能来自发泄您",青年活动"或他们发泄".如果输入的是theeaglehaslande,则没有这种方法.您的任务是以两种单独的方式实现动态编程解决方案:

We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from "the you the vent", "the youth event" or "they out he vent". If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:

  • 迭代式自底向上版本
  • 递归存储版本

假定原始单词序列没有其他标点符号(例如句号),没有大写字母,也没有专有名称-所有单词将在提供给您的词典文件中可用.

Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.

所以我有两个主要问题:

So I'm having two main issues:

  1. 我知道这可以并且应该在O(N ^ 2)中完成,而且我不认为我是查找表并未添加看起来似乎所有的单词,从而可以减少时间复杂度
  1. I know that this can and should be done in O(N^2) and I don't think mine is
  2. The lookup table isn't adding all the words it seems such that it can reduce the time complexity

我想要的东西:

  1. 任何形式的输入(更好的方法,您在代码中看到的错误,如何使查找表正常工作,如何使用布尔值表构建有效单词序列)
  2. 关于如何处理递归版本的一些想法,尽管我觉得一旦我能够解决迭代解决方案,我将能够从中设计递归解决方案.

一如既往地感谢任何人付出的时间和/或精力,对此我们深表感谢.

As always thanks for any time and or effort anyone gives this, it is always appreciated.

这是我的尝试:

#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
    diction = open("diction10k.txt",'r') 
    for x in diction:
        x = x.strip("\n \r")
        if s == x:
            return True
    return False

def iterativeSplit(s):
    n = len(s)
    i = j = k = 0
    A = [-1] * n
    word = [""] * n
    booly = False
    for i in range(0, n):
        for j in range(0, i+1):
            prefix = s[j:i+1]
            for k in range(0, n):

                if word[k] == prefix:
                    #booly = True
                    A[k] = 1
                    #print "Array below at index k %d and word = %s"%(k,word[k])
                    #print A
            # print prefix, A[i]
            if(((A[i] == -1) or (A[i] == 0))):
                if (dictW(prefix)):
                    A[i] = 1
                    word[i] = prefix
                    #print word[i], i
                else:
                    A[i] = 0
    for i in range(0, n):
        print A[i]

推荐答案

有关如何进行英语分词的另一个真实示例,请查看 .它稍微复杂一点,因为它使用单词和短语频率表,但说明了记忆方法.

For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.

特别是segment说明了记忆方法:

In particular, segment illustrates the memoization approach:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

如果您替换了score函数,以使其在字典中的单词返回"1",否则返回"0",那么您只需列举出所有获得满分的候选人即可.

If you replaced the score function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.

这篇关于使用动态编程进行分词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆