拟合分布、拟合优度、p 值.可以用 Scipy (Python) 做到这一点吗? [英] Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

查看:38
本文介绍了拟合分布、拟合优度、p 值.可以用 Scipy (Python) 做到这一点吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介:我是一名生物信息学家.在我对所有人类基因(大约 20 000 个)进行的分析中,我搜索特定的短序列基序以检查该基序在每个基因中出现的次数.

基因以四个字母(A、T、G、C)的线性序列书写".例如:CGTAGGGGGGTTTAC...这是遗传密码的四字母字母表,就像每个细胞的秘密语言,它是DNA实际存储信息的方式.

我怀疑某些基因中特定短基序序列 (AGTGGAC) 的频繁重复对于细胞中的特定生化过程至关重要.由于模体本身很短,计算工具很难区分基因中真正的功能示例和那些偶然看起来相似的示例.为了避免这个问题,我获取了所有基因的序列并将其连接成一个字符串并进行了混洗.存储了每个原始基因的长度.然后对于每个原始序列长度,通过从连接的序列中反复随机选择 A 或 T 或 G 或 C 并将其转移到随机序列来构建随机序列.这样,随机序列的结果集具有相同的长度分布,以及整体的 A、T、G、C 组成.然后我在这些随机序列中搜索主题.我执行了 1000 次这个程序并取平均值.

  • 15000 个不包含给定基序的基因
  • 5000 个包含 1 个基序的基因
  • 3000 个包含 2 个基序的基因
  • 1000 个包含 3 个基序的基因
  • ...
  • 1 个包含 6 个基序的基因

因此,即使对真正的遗传密码进行 1000 次随机化,也没有任何基因具有超过 6 个基序.但在真正的遗传密码中,有一些基因包含超过 20 次出现的基序,这表明这些重复可能是有功能的,而且不太可能纯粹是偶然地发现它们如此丰富.

问题:

我想知道在我的分布中找到一个基因的概率,假设该基序出现 20 次.所以我想知道偶然发现这样一个基因的概率.我想用 Python 实现它,但我不知道如何实现.

我可以用 Python 做这样的分析吗?

任何帮助将不胜感激.

解决方案

- 使用 Scipy (Python) 将经验分布拟合到理论分布?

INTRODUCTION: I'm a bioinformatician. In my analysis which I perform on all human genes (about 20 000) I search for a particular short sequence motif to check how many times this motif occurs in each gene.

Genes are 'written' in a linear sequence in four letters (A,T,G,C). For example: CGTAGGGGGTTTAC... This is the four-letter alphabet of genetic code which is like the secret language of each cell, it’s how DNA actually stores information.

I suspect that frequent repetitions of a particular short motif sequence (AGTGGAC) in some genes are crucial in a specific biochemical process in the cell. Since the motif itself is very short it is difficult with computational tools to distinguish between true functional examples in genes and those that look similar by chance. To avoid this problem I get sequences of all genes and concatenated into a single string and shuffled. The length of each of the original genes was stored. Then for each of the original sequence lengths, a random sequence was constructed by repeatedly picking A or T or G or C at random from the concatenated sequence and transferring it to the random sequence. In this way, the resulting set of randomized sequences has the same length distribution, as well as the overall A,T,G,C composition. Then I search for the motif in these randomized sequences. I performed this procedure 1000 times and averaged the results.

  • 15000 genes that do not contain a given motif
  • 5000 genes that contain 1 motif
  • 3000 genes that contain 2 motifs
  • 1000 genes that contain 3 motifs
  • ...
  • 1 gene that contain 6 motifs

So even after 1000 times randomization of true genetic code, there aren't any genes which have more than 6 motifs. But in the true genetic code, there are a few genes which contain more then 20 occurrences of the motif, which suggest that these repetition might be functional and it's unlikely to find them in such an abundance by pure chance.

PROBLEM:

I would like to know the probability of finding a gene with let's say 20 occurrences of the motif in my distribution. So I want to know the probability to find such a gene by chance. I would like to implement this in Python, but I don't know how.

Can I do such an analysis in Python?

Any help would be appreciated.

解决方案

In SciPy documentation you will find a list of all implemented continuous distribution functions. Each one has a fit() method, which returns the corresponding shape parameters.

Even if you don't know which distribution to use you can try many distrubutions simultaneously and choose the one that fits better to your data, like in the code below. Note that if you have no idea about the distribution it may be difficult to fit the sample.

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 20000
x = scipy.arange(size)
# creating the dummy sample (using beta distribution)
y = scipy.int_(scipy.round_(scipy.stats.beta.rvs(6,2,size=size)*47))
# creating the histogram
h = plt.hist(y, bins=range(48))

dist_names = ['alpha', 'beta', 'arcsine',
              'weibull_min', 'weibull_max', 'rayleigh']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    params = dist.fit(y)
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]
    if arg:
        pdf_fitted = dist.pdf(x, *arg, loc=loc, scale=scale) * size
    else:
        pdf_fitted = dist.pdf(x, loc=loc, scale=loc) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper left')
plt.show()

References:

- Distribution fitting with Scipy

- Fitting empirical distribution to theoretical ones with Scipy (Python)?

这篇关于拟合分布、拟合优度、p 值.可以用 Scipy (Python) 做到这一点吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆