在非常大的文件中按行号读取特定行 [英] Reading a particular line by line number in a very large file

查看:138
本文介绍了在非常大的文件中按行号读取特定行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该文件将不适合内存。它是超过100GB,我想访问特定行的行号。我不想一直计数,直到我到达它。

The file will not fit into memory. It is over 100GB and I want to access specific lines by line number. I do not want to count line by line until I reach it.

我读过 http://docstore.mik.ua/orelly/perl/cookbook/ch08_09.htm

当我使用以下方法构建索引时,行返回工作到某一点。一旦行号非常大,返回的行是相同的。当我去文件中的特定行,返回相同的行。它似乎适用于行号1到350000(大约);

When I built an index using the following methods, the line return works up to a certain point. Once the line number is very large, the line being returned is the same. When I go to the specific line in the file the same line is returned. It seems to work for line numbers 1 through 350000 (approximately);

 # usage: build_index(*DATA_HANDLE, *INDEX_HANDLE) 
    sub build_index {
        my $data_file  = shift;
        my $index_file = shift;
        my $offset     = 0;

        while (<$data_file>) {
            print $index_file pack("N", $offset);
            $offset = tell($data_file);
        }
    }

    # usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER)
    # returns line or undef if LINE_NUMBER was out of range
    sub line_with_index {
        my $data_file   = shift;
        my $index_file  = shift;
        my $line_number = shift;

        my $size;               # size of an index entry
        my $i_offset;           # offset into the index of the entry
        my $entry;              # index entry
        my $d_offset;           # offset into the data file

        $size = length(pack("N", 0));
        $i_offset = $size * ($line_number-1);
        seek($index_file, $i_offset, 0) or return;
        read($index_file, $entry, $size);
        $d_offset = unpack("N", $entry);
        seek($data_file, $d_offset, 0);
        return scalar(<$data_file>);
    }



我也尝试过使用DB_file方法,很长时间做领带。我也不太明白DB_RECNO访问方法将数组绑定到文件,每个数组元素一行意味着什么。 Tie不会将文件读入数组是否正确?

I've also tried using the DB_file method, but it seems to take a very long time to do the tie. I also don't really understand what it means for "DB_RECNO access method ties an array to a file, one line per array element." Tie does not read the file into the array correct?

推荐答案

pack N 创建一个32位整数。最大的32位整数是4GB,所以使用它来存储索引到一个大小为100GB的文件将不起作用。

pack N creates a 32-bit integer. The maximum 32-bit integer is 4GB, so using that to store indexes into a file that's 100GB in size won't work.

某些构建使用64位整数。在那些,你可以使用 j

Some builds use 64-bit integers. On those, you could use j.

有些构建使用32位整数。 tell 在其上返回浮点数,允许您无损地索引大小为8,388,608 GB的文件。在那些,你应该使用 F

Some builds use 32-bit integers. tell returns a floating-point number on those, allowing you to index files up to 8,388,608 GB in size losslessly. On those, you should use F.

便携式代码如下:

use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';

...
print $index_file pack($off_t, $offset);
...






假设索引文件仅由构建它的同一Perl使用(或至少一个具有相同整数大小,寻找大小和机器字节顺序)。让我知道,如果这个假设不适合你。


Note: I'm assuming the index file is only used by the same Perl that built it (or at least one with with the same integer size, seek size and machine endianness). Let me know if that assumption doesn't hold for you.

这篇关于在非常大的文件中按行号读取特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆