拟合分布,拟合优度,p值.是否可以使用Scipy(Python)做到这一点? [英] Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

查看:465
本文介绍了拟合分布,拟合优度,p值.是否可以使用Scipy(Python)做到这一点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介:我是一名生物信息学家.在我对所有人类基因(约2万个)进行的分析中,我搜索一个特定的短序列基序,以检查该基序在每个基因中出现了多少次.

基因以四个字母(A,T,G,C)以线性顺序写入".例如:CGTAGGGGGTTTAC ...这是由四个字母组成的遗传密码字母,就像每个细胞的秘密语言一样,是DNA实际存储信息的方式.

我怀疑某些基因中特定短基序序列(AGTGGAC)的频繁重复在细胞的特定生化过程中至关重要.由于基序本身非常短,因此使用计算工具很难区分基因中的真实功能实例和偶然看起来相似的实例.为了避免这个问题,我得到了所有基因的序列,并连接成一个字符串并重新排列.存储每个原始基因的长度.然后,对于每个原始序列长度,通过从串联序列中随机选取A或T或G或C并将其转移到随机序列来构建随机序列.这样,所得的随机序列集具有相同的长度分布以及整个A,T,G,C组成.然后,我在这些随机序列中搜索基序.我执行了此过程1000次,并对结果取平均值.

  • 15000个不包含给定基序的基因
  • 5000个包含1个基序的基因
  • 包含2个基序的3000个基因
  • 1000个包含3个基序的基因
  • ...
  • 1个包含6个基序的基因

因此,即使在对真正的遗传密码进行1000次随机化之后,也没有任何基因具有超过6个基序.但是在真正的遗传密码中,有几个基因包含超过20个基序出现,这表明这些重复可能是有功能的,不可能仅凭偶然发现它们如此丰富.

问题:

我想知道发现一个基因的概率,比如说我的分布中出现了20个基序.所以我想知道偶然发现这种基因的可能性.我想用Python实现,但是我不知道怎么做.

我可以在Python中进行这种分析吗?

任何帮助将不胜感激.

解决方案

fit()方法,它返回相应的形状参数.

即使您不知道要使用哪种发行版,也可以同时尝试许多发行版,然后选择更适合您的数据的发行版,如下面的代码所示.请注意,如果您不了解分布情况,可能很难拟合样本.

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 20000
x = scipy.arange(size)
# creating the dummy sample (using beta distribution)
y = scipy.int_(scipy.round_(scipy.stats.beta.rvs(6,2,size=size)*47))
# creating the histogram
h = plt.hist(y, bins=range(48))

dist_names = ['alpha', 'beta', 'arcsine',
              'weibull_min', 'weibull_max', 'rayleigh']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper left')
plt.show()

参考文献:

-具有Scipy的分布拟合

-使用Scipy(Python)使经验分布适合理论分布吗?

INTRODUCTION: I'm a bioinformatician. In my analysis which I perform on all human genes (about 20 000) I search for a particular short sequence motif to check how many times this motif occurs in each gene.

Genes are 'written' in a linear sequence in four letters (A,T,G,C). For example: CGTAGGGGGTTTAC... This is the four-letter alphabet of genetic code which is like the secret language of each cell, it’s how DNA actually stores information.

I suspect that frequent repetitions of a particular short motif sequence (AGTGGAC) in some genes are crucial in a specific biochemical process in the cell. Since the motif itself is very short it is difficult with computational tools to distinguish between true functional examples in genes and those that look similar by chance. To avoid this problem I get sequences of all genes and concatenated into a single string and shuffled. The length of each of the original genes was stored. Then for each of the original sequence lengths, a random sequence was constructed by repeatedly picking A or T or G or C at random from the concatenated sequence and transferring it to the random sequence. In this way, the resulting set of randomized sequences has the same length distribution, as well as the overall A,T,G,C composition. Then I search for the motif in these randomized sequences. I performed this procedure 1000 times and averaged the results.

  • 15000 genes that do not contain a given motif
  • 5000 genes that contain 1 motif
  • 3000 genes that contain 2 motifs
  • 1000 genes that contain 3 motifs
  • ...
  • 1 gene that contain 6 motifs

So even after 1000 times randomization of true genetic code, there aren't any genes which have more than 6 motifs. But in the true genetic code, there are a few genes which contain more then 20 occurrences of the motif, which suggest that these repetition might be functional and it's unlikely to find them in such an abundance by pure chance.

PROBLEM:

I would like to know the probability of finding a gene with let's say 20 occurrences of the motif in my distribution. So I want to know the probability to find such a gene by chance. I would like to implement this in Python, but I don't know how.

Can I do such an analysis in Python?

Any help would be appreciated.

解决方案

In SciPy documentation you will find a list of all implemented continuous distribution functions. Each one has a fit() method, which returns the corresponding shape parameters.

Even if you don't know which distribution to use you can try many distrubutions simultaneously and choose the one that fits better to your data, like in the code below. Note that if you have no idea about the distribution it may be difficult to fit the sample.

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 20000
x = scipy.arange(size)
# creating the dummy sample (using beta distribution)
y = scipy.int_(scipy.round_(scipy.stats.beta.rvs(6,2,size=size)*47))
# creating the histogram
h = plt.hist(y, bins=range(48))

dist_names = ['alpha', 'beta', 'arcsine',
              'weibull_min', 'weibull_max', 'rayleigh']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper left')
plt.show()

References:

- Distribution fitting with Scipy

- Fitting empirical distribution to theoretical ones with Scipy (Python)?

这篇关于拟合分布,拟合优度,p值.是否可以使用Scipy(Python)做到这一点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆