如何在python中将bigram作为表格编程？ [英] How do I program bigram as a table in python?

查看：132 发布时间：2017/5/21 19:27:15 python list dictionary markov-chains

本文介绍了如何在python中将bigram作为表格编程？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做这个家庭作业，在这一点我被困住了。
我无法编程英语中的Bigram频率，条件概率在python？

也就是说，概率令牌给定前面的令牌等于他们的双语组的概率，或两个令牌的同现除以上述令牌的概率。

我有一个包含多个字母的文本，那么我已经计算了这个文本中字母的概率，所以字母a出现在 0.015％ 与文本中的字母相比。

这些字母来自 ^ a-zA-Z ，我想要的是：

如何使用字母表（（字母）x（字母））的长度表，如何计算每种情况的条件概率？

它就像：

  [[（a | a），（b | a），（c | a） ...，（z | a），...（Z | a）] 
 [（a | b），（b | b），（c | b），...，（z | b） ，...（Z | b）] 
 ... ... 
 [（a | Z），（b | Z），（c | Z），...，（z | Z），...（Z | Z）]]

为此，我应该计算概率如：如果你在这一点上有一个字母'a'，那么你得到这个字母'a'的机会是什么，等等。

我无法入门，希望能够启动我，希望能够清楚我需要解决的问题。

解决方案

假设您的文件没有其他标点符号（容易剥离）：

  import itertools 
 
 def pairwise（s）：
a，b = itertools.tee（s）
 next（b）
 return zip（a，b ）
 
计数= [[范围（52）中的_为0）_在范围（52）]＃没有发生
与open（'path / to / input'）作为infile：
 for a，b in pairwise（char for line in infile for word in line.split（）for char in word）：＃从文本中获取成对字符
 given = ord（a） - 给定字符中的ord（'a'）＃索引（在计数中）
 char = ord（b） -  ord（'a'）＃ 
计数[给定] [char] + = 1 
 
＃现在我们有发生次数，我们用总额除以获得条件nal概率
 
 totals = [sum（count（i）for i in range（52））for count in count] 
 for given in range（52）：
 if not总计[给定]：
继续
为我的范围（len（计数[给定]））：
计数[给定] [i] / =总计[给定]

我没有测试过，但应该是一个很好的开始

这是一个字典版本，应该更容易阅读和调试：

  counting = {} 
 with打开（'path / to / input'）作为infile：
 for a，b in pairwise（char in for infile for word in line.split（）for char in word）：
 given = ord （a） -  ord（'a'）
 char = ord（b） -  ord（'a'）
如果不在计数中：
计数[给定] = {} 
如果char不在计数中[给定]：
计数[给定] [char] = 0 
计数[给定] [char] + = 1 
 
 answer = {} 
给出，在answer.items（）中的chardict：
 total = sum（ch ardict.values（））
为char，count in chardict.items（）：
答案[给定] [char] = count / total

现在，回答包含你以后的概率。如果你想要'a'的概率，给定'b'，看看 answer ['b'] ['a']

I'm doing this homework, and I am stuck at this point. I can't program Bigram frequency in the English language, 'conditional probability' in python?

That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.

I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015% compared to the letters in the text.

The letters are from ^a-zA-Z, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?

It's like:

[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
 [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
                    ...       ...
 [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]

and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.

I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.

解决方案

Assuming your file has no other punctuation (easy enough to strip out):

import itertools

def pairwise(s):
    a,b = itertools.tee(s)
    next(b)
    return zip(a,b)

counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
        given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
        char = ord(b) - ord('a')   # index of the character that follows the "given" character
        counts[given][char] += 1

# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities

totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
    if not totals[given]:
        continue
    for i in range(len(counts[given])):
        counts[given][i] /= totals[given]

I haven't tested this, but it should be a good start

Here's a dictionary version, which should be easier to read and debug:

counts = {}
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):
        given = ord(a) - ord('a')
        char = ord(b) - ord('a')
        if given not in counts:
            counts[given] = {}
        if char not in counts[given]:
            counts[given][char] = 0
        counts[given][char] += 1

answer = {}
for given, chardict in answer.items():
    total = sum(chardict.values())
    for char, count in chardict.items():
        answer[given][char] = count/total

Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

这篇关于如何在python中将bigram作为表格编程？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在python中将bigram作为表格编程？ [英] How do I program bigram as a table in python?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在python中将bigram作为表格编程？ [英] How do I program bigram as a table in python?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭