需要帮助来了解此Python Viterbi算法 [英] Need help understanding this Python Viterbi algorithm

查看:130
本文介绍了需要帮助来了解此Python Viterbi算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将此堆栈溢出答案中找到的Viterbi算法的Python实现转换为Ruby.完整脚本可以在问题的底部找到我的评论.

I'm trying to convert a Python implementation of the Viterbi algorithm found in this Stack Overflow answer into Ruby. The full script can be found at the bottom of this question with my comments.

不幸的是,我对Python知之甚少,因此翻译的难度比我想的要大.尽管如此,我还是取得了一些进展.现在,唯一使我的大脑完全融化的线是:

Unfortunately I know very little about Python so the translation is is proving more difficult than I'd like. Still, I have made some progress. Right now, the only line which is totally melting my brain is this one:

prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))

有人可以解释一下它在做什么吗?

Could someone please explain what it is doing?

这是完整的Python脚本:

Here is the full Python script:

import re
from itertools import groupby

# text will be a compound word such as 'wickedweather'.
def viterbi_segment(text):
  probs, lasts = [1.0], [0]

  # Iterate over the letters in the compound.
  # eg. [w, ickedweather], [wi, ckedweather], and so on.
  for i in range(1, len(text) + 1):
    # I've no idea what this line is doing and I can't figure out how to split it up?
    prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
    # Append values to arrays.
    probs.append(prob_k)
    lasts.append(k)

  words = []
  i = len(text)
  while 0 < i:
    words.append(text[lasts[i]:i])
    i = lasts[i]
  words.reverse()
  return words, probs[-1]

# Calc the probability of a word based on occurrences in the dictionary.
def word_prob(word):
  # dictionary.get(key) will return the value for the specified key.
  # In this case, thats the number of occurances of thw word in the
  # dictionary. The second argument is a default value to return if
  # the word is not found.
  return dictionary.get(word, 0) / total

# This ensures we ony deal with full words rather than each
# individual letter. Normalize the words basically.
def words(text):
  return re.findall('[a-z]+', text.lower())

# This gets us a hash where the keys are words and the values are the
# number of ocurrances in the dictionary.
dictionary = dict((w, len(list(ws)))
  # /usr/share/dixt/words is a file of newline delimitated words.
  for w, ws in groupby(sorted(words(open('/usr/share/dict/words').read()))))

# Assign the length of the longest word in the dictionary.
max_word_length = max(map(len, dictionary))

# Assign the total number of words in the dictionary. It's a float
# because we're going to divide by it later on.
total = float(sum(dictionary.values()))

# Run the algo over a file of newline delimited compound words.
compounds = words(open('compounds.txt').read())
for comp in compounds:
  print comp, ": ", viterbi_segment(comp)

推荐答案

您正在查看的扩展版本看起来像这样:

The expanded version looks something like this:

all_probs = []

for j in range(max(0, i - max_word_length), i):
    all_probs.append((probs[j] * word_prob(text[j:i]), j))

prob_k, k = max(all_probs)

我希望这有助于解释.如果没有,请随时编辑您的问题并指向您不理解的陈述.

I hope that helps explain it. If it doesn't, feel free to edit your question and point to statements that you don't understand.

这篇关于需要帮助来了解此Python Viterbi算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆