如何计算Python中列表成对比较的元素的频率? [英] How to calculate frequency of elements for pairwise comparisons of lists in Python?

查看:134
本文介绍了如何计算Python中列表成对比较的元素的频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将样本存储在以下列表中

  sample = [AAAA,CGCG,TTTT,AT-T, CATC] 

..为了说明问题,我把它们表示为

  Set1 AAAA 
Set2 CGCG
Set3 TTTT
Set4 AT-T
Set5 CATC




  1. 消除集合中每个元素与本身相同的所有集合。

输出:

  Set2 CGCG 
Set4 AT-T
Set5 CATC




  1. 执行两组之间的成对比较。 (Set2 v Set4,Set 2v Set5,Set4 v Set5)


  2. 每个成对比较只能有两种类型的组合,如果不是,那么这些成对比较被消除。例如,

      Set2 Set5 
    CC
    GA
    CT
    GC


这里有两种以上的对( CC),(GA),(CT)和(GC)。所以这种成对比较不会发生。 (AA,GG,CC,TT,AT,TA,AC,CA,AG,GA,GC,CG,GT中的每个比较可以只有两个组合,TG,CT,TC)...基本上所有可能的ACGT组合,其中顺序很重要。



在给定的示例中,找到超过2个这样的组合。



因此,Set2和Set4; Set4和Set5不能被考虑。因此,剩下的唯一对是:

 输出
Set2 CGCG
Set4 AT-T




  1. 这个成对比较,删除任何具有 - 的元素和其他对中的相应元素

     输出
    Set2 CGG
    Set4 ATT


  2. 计算Set2和Set4中元素的频率。计算组(CA和GT对)对中类型的发生频率

     输出
    Set2(C = 1/3,G = 2/3)
    Set4(A = 1/3,T = 2/3)
    对(CA = 1/3,GT = 2/3)


  3. 计算相应元素的float(a)=(Pairs) - (Set2)*(Set4)一对是足够的)

      eg。对于CA对,float(a)=(CA对的频率) - (C的频率)*(A的频率)


注意:如果该对是AAAC和CCCA,则C的频率将为1/4,即基站之间的频率为


  1. 计算

      float(b)= float(a)/(CGG中C的频率)*(CGG中的频率G)*(ATT中的频率A)*(ATT ==> ATT中的T的频率)


  2. 对所有成对比较重复此操作


例如。

  Set2 CGCG 
Set4 AT-T
Set6 GCGC

Set2 v Set4,Set2 v Set6,Set4 v Set6



我的半熟的代码直到现在:
**如果所有代码建议使用标准for循环格式,而不是理解**,我宁愿**

  #Step 1 
对于我在样本中:
对于范围(i)中的j:
如果j = j + 1#这需要更正为如果所有元素在i相同如果所有j都是一样的
del i
#insert代码行,其中sample1 =具有删除的新样本

#Step 2
for i,i + 1 in enumerate(sample):
#Step 3
for range(i):
for k in range(i + 1):
#insert行代码只能说包含两种类型的对,如果是,则继续其他跳过
#Step 4
如果j = - 或k = - :
#删除j / k和对应ng元素在另一对
#Step 5
count_dict = {}
square_dict = {}
在列表(i)中的基数:
如果基数在count_dict中:
count_dict [base] + = 1
else:
count_dict [base] = 1
在count_dict中的等位基因:
freq =(count_dict [等位基因] / len i))个人等位基因#frequencies
#计算成对的频率
#Step 6
无代码


解决方案

我认为这是你想要的:

 从集合导入计数器

#删除所有核碱基相同的元素。
范围内的索引(len(sample)-1,-1,-1):
如果sample [index] [:1] * len(sample [index])== sample [index] :
del sample [index]

for indexA,setA in enumerate(sample):
for indexB,setB in enumerate(sample):
#不要将样本与自身进行比较,同时对比两次。
如果indexA< = indexB:
继续

#计算唯一对数
pair_count = Counter()
用于zip(setA, setB):
如果 - 不在对中:
pair_count [pair] + = 1

#仅分析具有2个唯一对的集合对。
如果len(pair_count)!= 2:
继续

#计算个人基数。
base_counter = Counter()
用于pair,count in pair_count.items():
base_counter [pair [0]] + = count
base_counter [pair [1]] + = count

#获取该对中每个项目之一的长度。
sequence_length = sum(pair_count.values())

#将计数转换为频率。
base_freq = {}
为base,count in base_counter.items():
base_freq [base] = count / float(sequence_length)

#检查一对从两个独特的对来计算float_a。
pair = list(pair_count)[0]
float_a =(pair_count [pair] / float(sequence_length)) - base_freq [pair [0]] * base_freq [pair [1]]

#步骤7!
float_b = float_a / float(base_freq.get('A',0)* base_freq.get('T',0)* base_freq.get('C',0)* base_freq.get('G' ,0))

或者,更多的Pythonical(与list / dict的理解你不想要) :

 从集合导入计数器

BASES ='ATCG'

#删除所有核碱基相同的元素。
sample = [item [item 1] * len(item)!= item]
$ b for indexA中的项目,枚举中的setA(样本):
indexB,枚举中的setB(sample):
#不要将样本与自身进行比较,也不要将同一对比较两次。
如果indexA< = indexB:
继续

#计算唯一对数
relevant_pairs = [(elA,elB)for(elA,elB)) (setA,setB)如果elA!=' - '和elB!=' - ']
pair_count = Counter(relevant_pairs)

#仅分析具有2个唯一对的集合对。
如果len(pair_count)!= 2:
continue

#setA和setB作为元组,其中涉及到' - '的对。
setA,setB = zip(* relevant_pairs)

#获取每个基数的总数。
seq_length = len(setA)

#将计数转换为频率。 $($)
base_freq = {base:count / float(seq_length)for(base,count)in(Counter(setA)+ Counter(setB))items()}

#从两个独特的对来计算float_a。
pair = list(pair_count)[0]
float_a =(pair_count [pair] / float(seq_length)) - base_freq [pair [0]] * base_freq [pair [1]]

#步骤7!
denominator = 1
BASES中的基数:
分母* = base_freq.get(base,0)

float_b = float_a / denominator


I have the the sample stored in the following list

 sample = [AAAA,CGCG,TTTT,AT-T,CATC]

.. To illustrate the problem, I have denoted them as "Sets" below

Set1 AAAA
Set2 CGCG
Set3 TTTT
Set4 AT-T
Set5 CATC

  1. Eliminate all Sets where each every element in the set is identical to itself.

Output:

 Set2 CGCG
 Set4 AT-T
 Set5 CATC

  1. Perform pairwise comparison between the sets. (Set2 v Set4, Set 2v Set5, Set4 v Set5)

  2. Each pairwise comparison can have only two types of combinations, if not then those pairwise comparisons are eliminated. eg,

    Set2    Set5
    C       C
    G       A
    C       T 
    G       C
    

Here, there are more than two types of pairs (CC), (GA), (CT) and (GC). So this pairwise comparison cannot occur.

Every comparison can have only 2 combinations out of (AA, GG,CC,TT, AT,TA,AC,CA,AG,GA,GC,CG,GT,TG,CT,TC) ... basically all possible combinations of ACGT where order matters.

In the given example, more than 2 such combinations are found.

Hence, Set2 and Set4; Set4 and Set5 cannot be considered.Thus the only pairs, that remain are:

Output
Set2 CGCG
Set4 AT-T

  1. In this pairwise comparison, remove any the element with "-" and its corresponding element in the other pair

    Output    
    Set2 CGG
    Set4 ATT
    

  2. Calculate frequency of elements in Set2 and Set4. Calculate frequency of occurrence of types of pairs across the Sets (CA and GT pairs)

    Output
    Set2 (C = 1/3, G = 2/3)
    Set4 (A = 1/3, T = 2/3)
    Pairs (CA = 1/3, GT = 2/3)
    

  3. Calculate float(a) = (Pairs) - (Set2) * (Set4) for corresponding element (any one pair is sufficient)

    eg. For CA pairs, float (a) = (freq of CA pairs) - (freq of C) * (freq of A)
    

NOTE: If the pair is AAAC and CCCA, the freq of C would it be 1/4, i.e. it is the frequency of the base over one of the pairs

  1. Calculate

    float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)
    

  2. Repeat this for all pairwise comparisons

eg.

Set2 CGCG
Set4 AT-T
Set6 GCGC

Set2 v Set4, Set2 v Set6, Set4 v Set6

My half-baked code till now: ** I would prefer if all codes suggested would be in standard for-loop format and not comprehensions **

#Step 1
for i in sample: 
    for j in range(i):
        if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                        del i 
    #insert line of code where sample1 = new sample with deletions as above

#Step 2
    for i,i+1 in enumerate(sample):
    #Step 3
    for j in range(i):
        for k in range (i+1):
        #insert line of code to say only two types of pairs can be included, if yes continue else skip
            #Step 4
            if j = "-" or k = "-":
                #Delete j/k and the corresponding element in the other pair
                #Step 5
                count_dict = {}
                    square_dict = {}
                for base in list(i):
                    if base in count_dict:
                            count_dict[base] += 1
                    else:
                            count_dict[base] = 1
                    for allele in count_dict:
                    freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                    #Calculate frequency of pairs 
                #Step 6
                No code yet

解决方案

I think this is what you want:

from collections import Counter

# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
    if sample[index][:1] * len(sample[index]) == sample[index]:
        del sample[index]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        pair_count = Counter()
        for pair in zip(setA, setB):
            if '-' not in pair:
                pair_count[pair] += 1

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # Count individual bases.
        base_counter = Counter()
        for pair, count in pair_count.items():
            base_counter[pair[0]] += count
            base_counter[pair[1]] += count

        # Get the length of one of each item in the pair.
        sequence_length = sum(pair_count.values())

        # Convert counts to frequencies.
        base_freq = {}
        for base, count in base_counter.items():
            base_freq[base] = count / float(sequence_length)

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))

Or, more Pythonically (with the list/dict comprehensions you don't want):

from collections import Counter

BASES = 'ATCG'

# Remove elements where all nucleobases are the same.
sample = [item for item in sample if item[:1] * len(item) != item]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
        pair_count = Counter(relevant_pairs)

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # setA and setB as tuples with pairs involving '-' removed.
        setA, setB = zip(*relevant_pairs)

        # Get the total for each base.
        seq_length = len(setA)

        # Convert counts to frequencies.
        base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        denominator = 1
        for base in BASES:
            denominator *= base_freq.get(base, 0)

        float_b = float_a / denominator

这篇关于如何计算Python中列表成对比较的元素的频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆