如何在python中将bigram作为表格编程? [英] How do I program bigram as a table in python?

查看:132
本文介绍了如何在python中将bigram作为表格编程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做这个家庭作业,在这一点我被困住了。
我无法编程英语中的Bigram频率,条件概率在python?




也就是说,概率令牌给定前面的令牌等于他们的双语组的概率,或两个令牌的同现除以上述令牌的概率。


我有一个包含多个字母的文本,那么我已经计算了这个文本中字母的概率,所以字母a出现在 0.015% 与文本中的字母相比。



这些字母来自 ^ a-zA-Z ,我想要的是:

如何使用字母表((字母)x(字母))的长度表,如何计算每种情况的条件概率?



它就像:

  [[(a | a),(b | a),(c | a) ...,(z | a),...(Z | a)] 
[(a | b),(b | b),(c | b),...,(z | b) ,...(Z | b)]
... ...
[(a | Z),(b | Z),(c | Z),...,(z | Z),...(Z | Z)]]

为此,我应该计算概率如:如果你在这一点上有一个字母'a',那么你得到这个字母'a'的机会是什么,等等。



我无法入门,希望能够启动我,希望能够清楚我需要解决的问题。

解决方案

假设您的文件没有其他标点符号(容易剥离):

  import itertools 

def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b )

计数= [[范围(52)中的_为0)_在范围(52)]#没有发生
与open('path / to / input')作为infile:
for a,b in pairwise(char for line in infile for word in line.split()for char in word):#从文本中获取成对字符
given = ord(a) - 给定字符中的ord('a')#索引(在计数中)
char = ord(b) - ord('a')#
计数[给定] [char] + = 1

#现在我们有发生次数,我们用总额除以获得条件nal概率

totals = [sum(count(i)for i in range(52))for count in count]
for given in range(52):
if not总计[给定]:
继续
为我的范围(len(计数[给定])):
计数[给定] [i] / =总计[给定]

我没有测试过,但应该是一个很好的开始



这是一个字典版本,应该更容易阅读和调试:

  counting = {} 
with打开('path / to / input')作为infile:
for a,b in pairwise(char in for infile for word in line.split()for char in word):
given = ord (a) - ord('a')
char = ord(b) - ord('a')
如果不在计数中:
计数[给定] = {}
如果char不在计数中[给定]:
计数[给定] [char] = 0
计数[给定] [char] + = 1

answer = {}
给出,在answer.items()中的chardict:
total = sum(ch ardict.values())
为char,count in chardict.items():
答案[给定] [char] = count / total

现在,回答包含你以后的概率。如果你想要'a'的概率,给定'b',看看 answer ['b'] ['a']


I'm doing this homework, and I am stuck at this point. I can't program Bigram frequency in the English language, 'conditional probability' in python?

That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.

I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015% compared to the letters in the text.

The letters are from ^a-zA-Z, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?

It's like:

[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
 [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
                    ...       ...
 [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]

and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.

I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.

解决方案

Assuming your file has no other punctuation (easy enough to strip out):

import itertools

def pairwise(s):
    a,b = itertools.tee(s)
    next(b)
    return zip(a,b)

counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
        given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
        char = ord(b) - ord('a')   # index of the character that follows the "given" character
        counts[given][char] += 1

# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities

totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
    if not totals[given]:
        continue
    for i in range(len(counts[given])):
        counts[given][i] /= totals[given]

I haven't tested this, but it should be a good start

Here's a dictionary version, which should be easier to read and debug:

counts = {}
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):
        given = ord(a) - ord('a')
        char = ord(b) - ord('a')
        if given not in counts:
            counts[given] = {}
        if char not in counts[given]:
            counts[given][char] = 0
        counts[given][char] += 1

answer = {}
for given, chardict in answer.items():
    total = sum(chardict.values())
    for char, count in chardict.items():
        answer[given][char] = count/total

Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

这篇关于如何在python中将bigram作为表格编程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆