如何按升序对我最常用的3聚体列表进行排序? [英] How to sort my list of most frequent 3-mers in ascending order?

查看：129 发布时间：2021/4/15 19:47:06 c++ c bioinformatics

本文介绍了如何按升序对我最常用的3聚体列表进行排序?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写代码，以找出DNA序列中最常见的3聚体.我编写了一个代码，计算一个3聚体的出现，如果它大于1，则该代码会记录字符串和出现的次数.

I am writing a code to find out most frequent 3-mers in a DNA sequence. I wrote a code that counts the occurrence of a 3-mer and if it greater than 1 then code records both string and number of occurrences.

这给了我一个本质上多余的清单.我想对列表进行排序，这样我在列表中只会看到每个3聚体一次.

This is giving me a list that is redundant in nature. I want to sort the list such that I only see each 3-mer once in the list.

下面是编写的代码:

int main()
{
    char dna[1000];
    char read[3] = {0};
    char most_freq[3];

    printf("Enter the DNA sequence\n");
    fgets(dna, sizeof(dna), stdin);
    int i, j;
    for(i=0; i<strlen(dna)-3; i++)
    {

        read[0] = dna[i];
        read[1] = dna[i+1];
        read[2] = dna[i+2];
        int count=0, maxcount=1;
        for(j = 0; j < strlen(dna); j++)
        {
            if(dna[j] == read[0] && dna[j+1] == read[1] && dna[j+2] == read[2])
            {
                count++;
            }
            else
            {
                continue;
            }
        }
        if(count > maxcount)
        {
            maxcount = count;
            
           printf("%s %d\n", read, maxcount);
        }

    }
}

这是我输入的结果:

CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT

CGC 2
GCC 3
CCT 4
ATA 2
AGC 2
GCC 3
CCT 4
CTC 2
TCG 2
CGC 2
AGC 2
GCC 3
CCT 4
GTC 2
ATA 2
CTC 2
TCG 2
GTC 2
CCT 4

很明显，答案是CCT，但我不希望输出出现冗余.我该如何解决?

It is clear that the answer is CCT but I don't want redundancy in output. How do I resolve this?

推荐答案

这是在 C

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

typedef struct {
    char n_base[4];
    int count;
} NMer_3;

typedef struct {
    int count;
    NMer_3 trimer[4 * 4 * 4];
} dict;

int cmp(const void* a, const void* b) {
    return strncmp((char*)a, (char*)b, 3);
}

void insertTrimer(dict* d, char c[3]) {
    NMer_3* ptr = (NMer_3*)bsearch((void*)c, (void*)d->trimer, d->count,
                                   sizeof(NMer_3), cmp);
    if (ptr == NULL) {
        int offset = d->count;
        strncpy(d->trimer[offset].n_base, c, 3);
        d->trimer[offset].count = 1;
        d->count++;
        qsort(d->trimer, d->count, sizeof(NMer_3), cmp);
    } else {
        ptr->count++;
    }
}

int main() {
    char dna[1000];
    dict d;
    printf("Enter the DNA sequence\n");
    char* res = fgets(dna, sizeof(dna), stdin);
    if (res == NULL)
        return 1;
    char* ptr = &dna[0];
    for (int i = 0; i < strlen(dna) - 2; i++)
        insertTrimer(&d, ptr++);
    for (int i = 0; i < d.count; i++)
        printf("%s : %d\n", d.trimer[i].n_base, d.trimer[i].count);
    return 0;
}

基本上，每个3-mer是较大结构中的一个条目.较大的结构进行二进制搜索，并在每次发现新的3聚体时进行q分类.否则，如果找到重复，则其条目将递增.

Basically, each 3-mer is an entry in a larger struct. The larger struct is binary-searched, and q-sorted every time a new 3-mer is found. Otherwise, if a repeat is found, their entry is incremented.

这是您的输入所使用的结果

Here is the result used with your input

AAA : 1
AAT : 1
ACT : 1
AGC : 2
ATA : 2
ATG : 1
CAT : 1
CCT : 4
CGC : 2
CGG : 1
CGT : 1
CTA : 1
CTC : 2
CTT : 1
GAG : 1
GCC : 3
GCG : 1
GGA : 1
GTC : 2
TAA : 1
TAC : 1
TAG : 1
TAT : 1
TCA : 1
TCC : 1
TCG : 2
TGT : 1
TTA : 1

提高速度的方法:

使用类似水母的程序
使用哈希图.没有用于散列表/表的标准C库.基本上，您将要做的事情与我在这里所做的非常相似.内存管理可能是一个挑战.但是，您将对序列中的每个3-mer进行O(1)搜索，而不是O(log(n))，此外，加法运算只会是O(1)而不是O(n* log(n))排序.

如果您使用C ++进行操作，您将获得很多好处，首先是很多更简单的代码:

If you do it in C++, you get a lot of benefits, the first being much simpler code:

#include <string>
#include <iostream>
#include <map>

int main() {
    std::string dna;
    printf("Enter the DNA sequence\n");
    std::getline(std::cin, dna);
    auto d = std::map<std::string,int>{};
    for (int i = 0; i < dna.size() - 2; i++){
        auto mer3 = dna.substr(i,3);
        auto itr = d.find(mer3);
        if (itr == d.end()){
            d[mer3] = 1;
        } else {
            itr->second += 1;
        }
    }
    for (auto i : d) std::cout << i.first << ':' << i.second << '\n';
    std::cout <<std::endl;
    return 0;
}

这实际上与C示例相同.

This is effectively the same as the C example.

如果将 map 替换为 unordered_map ，它将变得更快，但是不会对输出进行排序.

If you replace map with unordered_map it becomes much faster, however, the output will not be sorted.

这篇关于如何按升序对我最常用的3聚体列表进行排序?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何按升序对我最常用的3聚体列表进行排序? [英] How to sort my list of most frequent 3-mers in ascending order?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何按升序对我最常用的3聚体列表进行排序? [英] How to sort my list of most frequent 3-mers in ascending order?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭