计算数百GB数据中的子序列 [英] Count subsequences in hundreds of GB of data

查看:69
本文介绍了计算数百GB数据中的子序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理一个非常大的文件,并计算文件中一定长度的所有序列的频率.

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.

为说明我在做什么,请考虑一个包含序列abcdefabcgbacbdebdbbcaebfebfebfeb

To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb

下面,代码读取整个文件,并获取长度为n的第一个子字符串(在下面,我将其设置为5,尽管我希望能够更改此值)并计算其频率:

Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:

abcde => 1

下一行,它向右移动一个字符并执行相同的操作:

Next line, it moves one character to the right and does the same:

bcdef => 1

然后继续处理其余的字符串,并打印5个最频繁的序列:

It then continues for the rest of the string and prints the 5 most frequent sequences:

open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'

my $seq = <$in>; # read whole file into string
my $len = length($seq);

my $seq_length = 5; # set k-mer length
my %data;

for (my $i = 0; $i <= $len - $seq_length; $i++) {
     my $kmer = substr($seq, $i, $seq_length);
     $data{$kmer}++;
}

# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
    print "$kmer $data{$kmer}\n";
    $count++;
    last if $count >= 5;
}


ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1


但是,我想找到一种更有效的方法来实现这一目标.如果输入文件是10GB或1000GB,那么将整个内容读入字符串将非常耗费内存.


However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.

我考虑过一次读取字符块,例如一次读取100个字符,然后按上述步骤进行操作,但是在这里,跨越2个字符块的序列将无法正确计数.

I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.

然后,我的想法是仅从字符串中读取n个字符,然后移至下n个字符并执行相同的操作,如上所示,将它们的出现频率进行哈希计算.

My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.

  • 是否有关于如何执行此操作的建议?我看过阅读时使用了偏移量,但无法理解关于如何将其合并到这里
  • substr是执行此任务最有效的内存工具吗?
  • Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
  • Is substr the most memory efficient tool for this task?

推荐答案

从您自己的代码中看,您的数据文件似乎只有一行数据-未由换行符分解-因此我假设在下面的我的解决方案中.即使该行的末尾可能有一个换行符,但选择最后五个最频繁的子序列也将其排除在外,因为它只会发生一次

From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once

该程序使用 sysread 从中获取任意大小的数据块该文件并将其附加到我们已经在内存中存储的数据

This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory

循环的主体大部分与您自己的代码相似,但是我使用了列表版本的for而不是C样式的版本,因为它更加清晰

The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer

在处理完每个块之后,内存中的数据将被截断为最后的SEQ_LENGTH-1个字节,然后循环的下一个循环从文件中提取更多数据

After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file

我还对K-mer大小和块大小使用了常量.毕竟它们是恒定的!

I've also use constants for the K-mer size and the chunk size. They are constant after all!

输出数据是在CHUNK_SIZE设置为7的情况下生成的,因此将有许多跨边界子序列的实例.它与您自己所需的输出匹配,但最后两个条目的计数为1.这是由于Perl哈希键固有的随机顺序,如果您需要具有相等计数的特定顺序的序列,则必须指定它,以便我可以更改排序

The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort

use strict;
use warnings 'all';

use constant SEQ_LENGTH => 5;           # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB

my $in_file = shift // 'in.txt';

open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};

my %data;
my $chunk;
my $length = 0;

while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {

    $length += $size;

    for my $offset ( 0 .. $length - SEQ_LENGTH ) {
         my $kmer = substr $chunk, $offset, SEQ_LENGTH;
         ++$data{$kmer};
    }

    $chunk = substr $chunk, -(SEQ_LENGTH-1);
    $length = length $chunk;
}

my @kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for @kmers[0..4];

输出

ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1

请注意以下行:$chunk = substr $chunk, -(SEQ_LENGTH-1);,它在我们通过while循环时设置了$chunk.这样可以确保正确地计算跨越2个块的字符串.

Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.

$chunk = substr $chunk, -4语句从当前块中删除除最后四个字符外的所有字符,以便下一次读取将文件中的CHUNK_SIZE字节追加到其余字符.这样,搜索将继续进行,但从下一个块之外的前一个块的字符的后4个字符开始:数据不会落入这些块之间的裂缝"中.

The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.

这篇关于计算数百GB数据中的子序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆