有人知道使用动态编程进行分词的示例算法吗? [英] Anyone know an example algorithm for word segmentation using dynamic programming?

查看:20
本文介绍了有人知道使用动态编程进行分词的示例算法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您在谷歌搜索分词,确实没有很好的描述,我只是想完全理解动态编程算法将字符串分割成单个单词的过程.有谁知道一个可以很好地描述分词问题的地方,或者任何人都可以描述它吗?

If you search google for word segmentation there really are no very good descriptions of it and I'm just trying to fully understand the process a dynamic programming algorithm takes to find a segmentation of a string into individual words. Does anyone know a place where there is a good description of a word segmentation problem or can anyone describe it?

分词基本上只是获取一串字符并决定在哪里将其拆分为单词,如果您不知道并使用动态编程它会考虑一些子问题.使用递归这很简单,但我无法在网上找到任何地方,甚至只是在网上找到一个迭代算法的描述,所以如果有人有任何例子或者可以给出一个很棒的算法.

Word Segmentation is basically just taking a string of characters and deciding where to split it up into words if you didn't know and using dynamic programming it would take into account some amount of subproblems. This is pretty simple to do using recursion but I haven't been able to find anywhere online to find even just a description of an iterative algorithm for this online, so if anyone has any examples or can give an algorithm that would be great.

感谢您的帮助.

推荐答案

Python 词段模块 有这样的算法.它使用递归和记忆来实现动态编程.

The Python wordsegment module has such an algorithm. It uses recursion and memoization to implement dynamic programming.

源代码可在 Github 上找到,以下是相关片段:

The source is available on Github, here's the relevant snippet:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

注意 memo 如何缓存对 search 的调用,而后者又从 candidates 中选择最大值.

Note how memo caches calls to search which in turn selects the max from candidates.

这篇关于有人知道使用动态编程进行分词的示例算法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆