如何在python中将bigram作为表格编程? [英] How do I program bigram as a table in python?
问题描述
我无法编程英语中的Bigram频率,条件概率在python?
也就是说,概率令牌给定前面的令牌等于他们的双语组的概率,或两个令牌的同现除以上述令牌的概率。
我有一个包含多个字母的文本,那么我已经计算了这个文本中字母的概率,所以字母a出现在 0.015%
与文本中的字母相比。
这些字母来自 ^ a-zA-Z
,我想要的是:
如何使用字母表((字母)x(字母))的长度表,如何计算每种情况的条件概率?
它就像:
[[(a | a),(b | a),(c | a) ...,(z | a),...(Z | a)]
[(a | b),(b | b),(c | b),...,(z | b) ,...(Z | b)]
... ...
[(a | Z),(b | Z),(c | Z),...,(z | Z),...(Z | Z)]]
为此,我应该计算概率如:如果你在这一点上有一个字母'a',那么你得到这个字母'a'的机会是什么,等等。
我无法入门,希望能够启动我,希望能够清楚我需要解决的问题。
假设您的文件没有其他标点符号(容易剥离):
import itertools
def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b )
计数= [[范围(52)中的_为0)_在范围(52)]#没有发生
与open('path / to / input')作为infile:
for a,b in pairwise(char for line in infile for word in line.split()for char in word):#从文本中获取成对字符
given = ord(a) - 给定字符中的ord('a')#索引(在计数中)
char = ord(b) - ord('a')#
计数[给定] [char] + = 1
#现在我们有发生次数,我们用总额除以获得条件nal概率
totals = [sum(count(i)for i in range(52))for count in count]
for given in range(52):
if not总计[给定]:
继续
为我的范围(len(计数[给定])):
计数[给定] [i] / =总计[给定]
我没有测试过,但应该是一个很好的开始
这是一个字典版本,应该更容易阅读和调试:
counting = {}
with打开('path / to / input')作为infile:
for a,b in pairwise(char in for infile for word in line.split()for char in word):
given = ord (a) - ord('a')
char = ord(b) - ord('a')
如果不在计数中:
计数[给定] = {}
如果char不在计数中[给定]:
计数[给定] [char] = 0
计数[给定] [char] + = 1
answer = {}
给出,在answer.items()中的chardict:
total = sum(ch ardict.values())
为char,count in chardict.items():
答案[给定] [char] = count / total
现在,回答
包含你以后的概率。如果你想要'a'的概率,给定'b',看看 answer ['b'] ['a']
I'm doing this homework, and I am stuck at this point. I can't program Bigram frequency in the English language, 'conditional probability' in python?
That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.
I have a text with many letters, then I have calculated the probability for the letters in this text, so the letter 'a' appears 0.015%
compared to the letters in the text.
The letters are from ^a-zA-Z
, and what I want is:
How can I make a table with the lengths of the alphabet ((alphabet)x(alphabet)), and how do I calculate the conditional probability for every single situation?
It's like:
[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
[(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
... ...
[(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]
and for this I should calculate the probability, like: What's the chances that you get the letter 'a' if you at this point have an letter 'a', and so on.
I can't get started, hope you can kickstart me, and hope that it's clear what I need to solve.
Assuming your file has no other punctuation (easy enough to strip out):
import itertools
def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b)
counts = [[0 for _ in range(52)] for _ in range(52)] # nothing has occurred yet
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word): # get pairwise characters from the text
given = ord(a) - ord('a') # index (in `counts`) of the "given" character
char = ord(b) - ord('a') # index of the character that follows the "given" character
counts[given][char] += 1
# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities
totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
if not totals[given]:
continue
for i in range(len(counts[given])):
counts[given][i] /= totals[given]
I haven't tested this, but it should be a good start
Here's a dictionary version, which should be easier to read and debug:
counts = {}
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word):
given = ord(a) - ord('a')
char = ord(b) - ord('a')
if given not in counts:
counts[given] = {}
if char not in counts[given]:
counts[given][char] = 0
counts[given][char] += 1
answer = {}
for given, chardict in answer.items():
total = sum(chardict.values())
for char, count in chardict.items():
answer[given][char] = count/total
Now, answer
contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']
这篇关于如何在python中将bigram作为表格编程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!