在Perl或Python中按列值将100亿行文件拆分为5,000个文件 [英] split 10 billion line file into 5,000 files by column value in Perl or Python

查看:201
本文介绍了在Perl或Python中按列值将100亿行文件拆分为5,000个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个100亿行以制表符分隔的文件,我想根据一个列(第一列)将其拆分为5,000个子文件.如何在Perl或Python中有效地做到这一点?

I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python?

之前曾有人问过这个问题,但是所有方法都会为读取的每一行打开一个文件,或者将所有数据都保存在内存中.

This has been asked here before but all the approaches open a file for each row read, or they put all the data in memory.

推荐答案

该程序将按照您的要求进行.它希望在命令行上将输入文件作为参数,并写入名称取自输入文件记录第一列的输出文件

This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records

它保留文件句柄的哈希%fh和标志的并行哈希%opened,这些标志指示以前是否打开过给定文件.如果文件出现在%opened哈希中,则打开该文件以进行附加操作;如果以前从未打开过该文件,则将其打开以进行写入操作.如果达到了打开文件的限制,则将关闭(随机)1,000个文件句柄的选择.跟踪每个句柄的使用时间并关闭最过期的句柄是没有意义的:如果输入文件中的数据是随机排序的,则哈希中的每个句柄都有相同的机会成为下一个要使用的句柄,或者,如果数据已经排序,则不会再使用任何文件句柄

It keeps a hash %fh of file handles and a parallel hash %opened of flags that indicate whether a given file has ever been opened before. A file is opened for append if it appear in the %opened hash, or for write if it has never been opened before. If the limit on open files is hit then a (random) selection of 1,000 file handles is closed. There is no point in keeping track of when each handle was last used and closing the most out of date handles: if the data in the input file is randomly ordered then every handle in the hash has the same chance of being the next to be used, alternatively if the data is already sorted then none of the file handles will ever be used again

use strict;
use warnings 'all';

my %fh;
my %opened;

while ( <> ) {

    my ($tag) = split;

    if ( not exists $fh{$tag} ) {

        my $mode = $opened{$tag} ? '>>' : '>';

        while () {

            eval {
                open $fh{$tag}, $mode, $tag or die qq{Unable to open "$tag" for output: $!};
            };

            if ( not $@ ) {
                $opened{$tag} = 1;
                last;
            }

            die $@ unless $@ =~ /Too many open files/;

            my $n;
            for my $tag ( keys %fh ) {
                my $fh = delete $fh{$tag};
                close $fh or die $!;
                last if ++$n >= 1_000 or keys %fh == 0;
            }
        }
    }

    print { $fh{$tag} } $_;
}


close $_ or die $! for values %fh;

这篇关于在Perl或Python中按列值将100亿行文件拆分为5,000个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆