Python - 在大型数据集上计算多项概率密度函数? [英] Python - calculate multinomial probability density functions on large dataset?

查看:180
本文介绍了Python - 在大型数据集上计算多项概率密度函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我原来打算使用MATLAB来解决这个问题,但是内建函数的局限性并不适合我的目标。 NumPy中也有同样的限制。



我有两个制表符分隔的文件。第一个是显示蛋白质结构内部数据库的氨基酸残基,频率和计数的文件,即

  A 0.25 1 
S 0.25 1
T 0.25 1
P 0.25 1

第二个文件由氨基酸四元组和它们发生的次数组成,即

  ASTP 1 

注意,有8,000个这样的四元组。



基于每个氨基酸发生的背景频率和四联体的计数,我的目的是计算每个四元组的多项概率密度函数,随后将其用作最大似然计算中的预期值。



多项式分布如下:

  f(x | n,p)= n!/ *(* p2 ^ x2)* ... *(pk ^ xk))
pre>

其中x是固定p的n个试验中每个k个结果的数量抢劫在我的计算中,n为4 4。



我已经创建了四个函数来计算此分发。

 多项式分布的函数


def expected_quadruplets(x,y):
expected = x * y
return expected

#发生概率上升到出现次数

def prod_prob(p1,a,p2,b,p3,c,p4,d):
prob_prod =(pow(p1,a) )*(pow(p2,b))*(pow(p3,c))*(pow(p4,d))
return prob_prod


#factorial()和multinomial_coefficient()一起工作来计算C,多项式系数

def因子(n):
如果n <= 1:
返回1
返回n * factorial(n-1)


def multinomial_coefficient(a,b,c,d):
n = 24.0
multi_coeff =(n /(factorial a)*因子(b)*因子(c)*因子(d))
返回multi_coeff

问题是如何最好地按顺序构建数据以最有效的方式处理计算,以我可以读取的方式(你们写一些隐藏代码:-)),并且不会产生溢出或运行时错误。



到目前为止,我的数据被表示为嵌套列表。

  amino_acids = [['A','0.25','1' ],['S','0.25','1'],['T','0.25','1'],['P','0.25','1']] 

quadruplets = [['ASTP','1']]

我最初打算调用这些函数在嵌套的for循环中,但是这导致运行时错误或溢出错误。我知道我可以重新设置递归限制,但是我宁愿更加优雅。



我有以下几点:




quad = i [0] .split('')
在氨基酸中的j:
for四元组中的k

for v in k:
if j [0] == v:
multinomial_coefficient(int(j [2]),int(j [2]),int(j [2 ]),int(j [2]))

我没有真正了解如何整合其他功能还没有。我想我现在的嵌套列表排列是次优的。



我希望将字符串ASTP中的每个字母与amino_acids中每个子列表的第一个组件进行比较。如果匹配存在,我希望使用索引将适当的数值传递给函数。



他们是更好的方法吗?我可以将每个氨基酸和四元组的适当数字附加到循环中的临时数据结构中,将其传递给函数并将其清除以进行下一次迭代?



谢谢,S: - )

解决方案

这可能与您的原始问题相切,但我强烈建议不要因溢出而明确计算因子。而是使用 factorial(n) = gamma(n + 1)的事实,使用伽马函数和使用加法而不是乘法,减法而不是分割。 scipy.special 包含一个名为 gammaln 的函数,它给出了gamma函数的对数。

  from itertools import izip 
from numpy import array,log,exp
from scipy.special import gammaln

def log_factorial(x):
返回x!
的对数另外还接受列表和NumPy数组代替x。
return gammaln(array(x )

$ b def multinomial(xs,ps):
n = sum(xs)
xs,ps = array(xs),array(ps)
result = log_factorial(n) - sum(log_factorial(xs))+ sum(xs * log(ps))
返回exp(结果)

如果你不想为了 gammaln 安装SciPy,这里是纯Python(的当然它比较慢,它没有像SciPy中的那样向量化):

  def gammaln(n):
离散值的欧拉伽马函数的对数。
如果n < 1:
return float('inf')
如果n < 3:
return 0.0
c = [76.18009172947146,-86.50532032941677,\
24.01409824083091,-1.231739572450155,\
0.001208650973866179,-0.5395239384953 * 0.00001]
x,y = float(n),float(n)
tm = x + 5.5
tm - =(x + 0.5)* log(tm)
se = 1.0000000000000190015
(6):
y + = 1.0
se + = c [j] / y
return -tm + log(2.5066282746310005 * se / x)

另一个简单的诀窍是使用 dict amino_acids ,由残留本身索引。鉴于您原来的 amino_acids 结构,您可以这样做:

  amino_acid_dict =氨基酸中氨基酸的氨基酸((氨基酸),氨基酸(氨基酸))
print amino_acid_dict
{A:[A,0.25,1],S:[S ,1],T:[T,0.25,1],P:[P,0.25,1]}

然后,您可以更容易地查找频率或计数:

  freq_A = amino_acid_dict [A] [1] 
count_A = amino_acid_dict [A] [2]

这节省了一些时间在主循环中:

  for quadruplet in fourruplets:
probs = [amino_acid_dict [aa] [1] for aa in quadruplet]
计数= [amino_acid_dict [aa] [2] for aa in quadruplet]
print quadruplet,multinomial(counting,probs)


I originally intended to use MATLAB to tackle this problem but the in-built function has limitations that do not suit my goal. The same limitation occurs in NumPy.

I have two tab-delimited files. The first is a file showing amino acid residue, frequency and count for an in-house database of protein structures, i.e.

A    0.25    1
S    0.25    1
T    0.25    1
P    0.25    1

The second file consists of quadruplets of amino acids and the number of times they occur, i.e.

ASTP    1

Note, there are >8,000 such quadruplets.

Based on the background frequency of occurence of each amino acid and the count of quadruplets, I aim to calculate the multinomial probability density function for each quadruplet and subsequently use it as the expected value in a maximum likelihood calculation.

The multinomial distribution is as follows:

f(x|n, p) = n!/(x1!*x2!*...*xk!)*((p1^x1)*(p2^x2)*...*(pk^xk))

where x is the number of each of k outcomes in n trials with fixed probabilities p. n is 4 four in all cases in my calculation.

I have created four functions to calculate this distribution.

# functions for multinomial distribution


def expected_quadruplets(x, y):
    expected = x*y
    return expected

# calculates the probabilities of occurence raised to the number of occurrences

def prod_prob(p1, a, p2, b, p3, c, p4, d):
    prob_prod = (pow(p1, a))*(pow(p2, b))*(pow(p3, c))*(pow(p4, d))
    return prob_prod 


# factorial() and multinomial_coefficient() work in tandem to calculate C, the multinomial coefficient

def factorial(n):
    if n <= 1:
        return 1
    return n*factorial(n-1)


def multinomial_coefficient(a, b, c, d):
    n = 24.0
    multi_coeff =  (n/(factorial(a) * factorial(b) * factorial(c) * factorial(d)))
    return multi_coeff

The problem is how best to structure the data in order to tackle the calculation most efficiently, in a manner that I can read (you guys write some cryptic code :-)) and that will not create an overflow or runtime error.

To date my data is represented as nested lists.

amino_acids = [['A', '0.25', '1'], ['S', '0.25', '1'], ['T', '0.25', '1'], ['P', '0.25', '1']]

quadruplets = [['ASTP', '1']]

I initially intended calling these functions within a nested for loop but this resulted in runtime errors or overflow errors. I know that I can reset the recursion limit but I would rather do this more elegantly.

I had the following:

for i in quadruplets:
    quad = i[0].split(' ')
    for j in amino_acids:
        for k in quadruplets:
            for v in k:
                if j[0] == v:
                    multinomial_coefficient(int(j[2]), int(j[2]), int(j[2]), int(j[2]))

I haven'te really gotten to how to incorporate the other functions yet. I think that my current nested list arrangement is sub optimal.

I wish to compare each letter within the string 'ASTP' with the first component of each sub list in amino_acids. Where a match exists, I wish to pass the appropriate numeric values to the functions using indices.

Is their a better way? Can I append the appropriate numbers for each amino acid and quadruplet to a temporary data structure within a loop, pass this to the functions and clear it for the next iteration?

Thanks, S :-)

解决方案

This might be tangential to your original question, but I strongly advise against calculating factorials explicitly due to overflows. Instead, make use of the fact that factorial(n) = gamma(n+1), use the logarithm of the gamma function and use additions instead of multiplications, subtractions instead of divisions. scipy.special contains a function named gammaln, which gives you the logarithm of the gamma function.

from itertools import izip
from numpy import array, log, exp
from scipy.special import gammaln

def log_factorial(x):
    """Returns the logarithm of x!
    Also accepts lists and NumPy arrays in place of x."""
    return gammaln(array(x)+1)

def multinomial(xs, ps):
    n = sum(xs)
    xs, ps = array(xs), array(ps)
    result = log_factorial(n) - sum(log_factorial(xs)) + sum(xs * log(ps))
    return exp(result)

If you don't want to install SciPy just for the sake of gammaln, here is a replacement in pure Python (of course it is slower and it is not vectorized like the one in SciPy):

def gammaln(n):
    """Logarithm of Euler's gamma function for discrete values."""
    if n < 1:
        return float('inf')
    if n < 3:
        return 0.0
    c = [76.18009172947146, -86.50532032941677, \
         24.01409824083091, -1.231739572450155, \
         0.001208650973866179, -0.5395239384953 * 0.00001]
    x, y = float(n), float(n)
    tm = x + 5.5
    tm -= (x + 0.5) * log(tm)
    se = 1.0000000000000190015
    for j in range(6):
        y += 1.0
        se += c[j] / y
    return -tm + log(2.5066282746310005 * se / x)

Another easy trick is to use a dict for amino_acids, indexed by the residue itself. Given your original amino_acids structure, you can do this:

amino_acid_dict = dict((amino_acid[0], amino_acid) for amino_acid in amino_acids)
print amino_acid_dict
{"A": ["A", 0.25, 1], "S": ["S", 0.25, 1], "T": ["T", 0.25, 1], "P": ["P", 0.25, 1]}

You can then look up the frequencies or counts by residue easier:

freq_A = amino_acid_dict["A"][1]
count_A = amino_acid_dict["A"][2]

This saves you some time in the main loop:

for quadruplet in quadruplets:
    probs = [amino_acid_dict[aa][1] for aa in quadruplet]
    counts = [amino_acid_dict[aa][2] for aa in quadruplet]
    print quadruplet, multinomial(counts, probs)

这篇关于Python - 在大型数据集上计算多项概率密度函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆