如何计算列表列表中对的计数和频率? [英] How to calculate counts and frequencies for pairs in list of lists?

查看:107
本文介绍了如何计算列表列表中对的计数和频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

碱基是指A,T,G和C

Bases refers to A,T,G and C

sample = [['CGG','ATT'],['GCGC','TAAA']]

# Note on fragility of data: Each element can only be made up only 2 of the 4 bases.  
# [['CGG' ==> Only C and G,'ATT' ==> Only A and T],['GCGC'==> Only C and G,'TAAA' ==> Only T and A]]
# Elements like "ATGG" are not present in the data as the have more than 3 different types of bases

考虑第一对:['CGG','ATT']

Consider the first pair : ['CGG','ATT']

  1. 分别计算配对中每个碱基的频率:

  1. Calculate frequency of each base in the pairs separately:

CGG =>(C = 1/3,G = 2/3) ATT =>(A = 1/3,T = 2/3)

CGG => (C = 1/3, G = 2/3) ATT => (A = 1/3, T = 2/3)

计算对中碱基组合的出现频率.在此,组合为"CA"和"GT"(注意,基本顺序很重要.不是"CA","AC","GT"和"TG".仅是"CA"和"GT") .

Calculate frequency of occurrence of combination of bases in the pairs. Here, the combinations are 'CA' and 'GT' (Notice, order of the base matters. It is not 'CA','AC','GT' and 'TG'. Just only 'CA' and 'GT').

对=>(CA = 1/3,GT = 2/3)

Pairs => (CA = 1/3, GT = 2/3)

计算float(a)=(成对频率)-((CGG中C的频率)*(ATT中A的频率))

Calculate float(a) = (freq of Pairs) - ((freq of C in CGG) * (freq of A in ATT))

例如,CA对中的浮点数(a)=(CA对的频率)-((CGG中C的频率)*(ATT中A的频率))

Eg in CA pairs, float (a) = (freq of CA pairs) - ((freq of C in CGG) * (freq of A in ATT))

输出a =(1/3)-((1/3)*(1/3))= 0.222222

Output a = (1/3) - ((1/3) * (1/3)) = 0.222222

为任意一个组合(CA对或GT对)计算"a"

Calculating "a" for any one combination (either CA pair or GT pair)

注意:如果该对为AAAC和CCCA,则C的频率为1/4,即它是其中一对的基频.

NOTE: If the pair is AAAC and CCCA, the freq of C would it be 1/4, i.e. it is the frequency of the base over one of the pairs

  1. 计算b float(b)=(float(a)^ 2)/(CGG中C的频率)*(CGG中C的频率)*(ATT中A的频率)*(ATT中T的频率)

  1. Calculate b float (b) = (float(a)^2)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (freq of T in ATT)

Output b = 1

对整个列表执行此操作

   Final Output a = [0.2222, - 0.125]
                b = [1, 0.3333]

此代码已从

This code has been adapted from this answer. Please note that there are subtle differences in the two questions and they are NOT the same, in the approach to the problem.

但是,我无法运行此代码.我收到以下错误: 对于一对,算入i: TypeError:"int"对象不可迭代

However, I am unable to get this code to run. I get the following error: for pair, count in i: TypeError: 'int' object is not iterable

#Count individual bases.

sample4 = [['CGG','ATT'],['GCGC','TAAA']]
base_counter = Counter()
for i in enumerate(sample4):
    for pair, count in i:
        base_counter[pair[0]] += count
        base_counter[pair[1]] += count
        print base_counter

# Get the total for each base.
total_count = sum(base_counter.values())

# Convert counts to frequencies.
base_freq = {}
for base, count in base_counter.items():
    base_freq[base] = count / total_count
# Not sure how to write a code to count the number of pairs (Step 2)
# Let's say the counts have been stored in pair_counts

# Examine a pair from the two unique pairs to calculate float_a.
for i in enumerate(sample4):
    float(a) = (pair_count[pair] / sum(pair_count.values())) - (base_freq[pair[0]] * base_freq[pair[1]])

# Step 7!
for i in enumerate(sample4):
    float_b = float_a / float(base_freq[0][0] * base_freq[0][1] * base_freq[1][0] * base_freq[1][1])

推荐答案

您真正使用的Counter与普通的dict没有什么不同.尝试使用以下方法:

You are not really using Counter any different than a plain dict. Try something like the following approach:

>>> sample = [['CGG','ATT'],['GCGC','TAAA']]
>>> from collections import Counter
>>> base_counts = [[Counter(base) for base in sub] for sub in sample]
>>> base_counts
[[Counter({'G': 2, 'C': 1}), Counter({'T': 2, 'A': 1})], [Counter({'G': 2, 'C': 2}), Counter({'A': 3, 'T': 1})]]

现在,您可以继续使用嵌套理解来转换数据的功能方法*:

Now you can continue with a functional approach using nested comprehensions to transform your data*:

>>> base_freqs = [[{k_v[0]:k_v[1]/len(bases[i]) for i,k_v in enumerate(count.items())} for count in counts] 
...               for counts, bases in zip(base_counts, sample)]
>>> 
>>> base_freqs
[[{'G': 0.6666666666666666, 'C': 0.3333333333333333}, {'A': 0.3333333333333333, 'T': 0.6666666666666666}], [{'G': 0.5, 'C': 0.5}, {'A': 0.75, 'T': 0.25}]]
>>> 

*请注意,有些人不喜欢这样的大型嵌套式理解.我认为只要您坚持使用功能构造并且不对理解范围内的数据结构进行变异,就可以了.我实际上发现它很有表现力.其他人则强烈反对.您始终可以将这些代码展开为嵌套的for循环.

*Note, some people do not like big, nested comprehensions like that. I think it's fine as long as you are sticking to functional constructs and not mutating data structures inside your comprehensions. I actually find it very expressive. Others disagree vehemently. You can always unfold that code into nested for-loops.

无论如何,然后您就可以对使用相同的东西.首先:

Anyway, you can then work the same thing with the pairs. First:

>>> pairs = [list(zip(*bases)) for bases in sample]
>>> pairs
[[('C', 'A'), ('G', 'T'), ('G', 'T')], [('G', 'T'), ('C', 'A'), ('G', 'A'), ('C', 'A')]]
>>> pair_counts = [Counter(base_pair) for base_pair in pairs]
>>> pair_counts
[Counter({('G', 'T'): 2, ('C', 'A'): 1}), Counter({('C', 'A'): 2, ('G', 'T'): 1, ('G', 'A'): 1})]
>>> 

现在,在这里不使用理解会更容易,因此我们不必多次计算total:

Now, here it is easier to not use comprehensions so we don't have to calculate total more than once:

>>> pair_freq = []
>>> for count in pair_counts:
...   total = sum(count.values())
...   pair_freq.append({k:c/total for k,c in count.items()})
... 
>>> pair_freq
[{('C', 'A'): 0.3333333333333333, ('G', 'T'): 0.6666666666666666}, {('G', 'T'): 0.25, ('C', 'A'): 0.5, ('G', 'A'): 0.25}]
>>> 

这篇关于如何计算列表列表中对的计数和频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆