合并排序的多个文件合并为1排序文件 [英] Merging sorted multiple files into 1 sorted file

查看:99
本文介绍了合并排序的多个文件合并为1排序文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个排序文件合并成1排序文件。
presently我正在读的每个文件(12字节)中的一个数据项,写入最大到输出文件和递增从中我复制的数据的输入文件的指针,很像合并排序的合并步骤。

I have to merge multiple sorted files into 1 sorted file. Presently I am reading one data entry of each file (12 bytes), writing the maximum into the output file and incrementing the pointer of the input file from which I copied the data, much like the merge step of merge sort.

while (n){
    //read one data element (12 bytes)
    //find maximum
    //write the maximum
    //increment file pointer containing of file containng maximum
}

此过程被重复N =约300万次。的问题是,它是采取了大量的时间(约30秒)。我想,以减少采取采取下10秒。它是如何做任何想法?

This process is being repeated n=approx 3000000 times. The problem is that it is taking a large amount of time (approx 30 seconds). I would like to reduce the taken taken to under 10 sec. Any ideas on how it can be done?

推荐答案

我创建的每个10个文件有300,000行,每行11包含的字符(主要是位数)和一个换行符,使用此Perl脚本:

Data generation

I created 10 files each with 300,000 lines, each line containing 11 characters (mainly digits) and a newline, using this Perl script:

#!/usr/bin/env perl
use strict;
use warnings;

# Generate sorted, merge-worthy data

die "Usage: num-files lines-per-file" unless scalar(@ARGV) == 2;
my $num_files = $ARGV[0];
my $num_lines = $ARGV[1];
die "Enter a number of files between 2 and 999" unless $num_files >= 2 && $num_files <= 999;
die "Enter a number of lines between 2 and 9,999,999" unless $num_files >= 2 && $num_files <= 9_999_999;

my $base = "datafile";
my $digits = length($num_files . '');
my $format = "$base.%.${digits}d";

foreach my $i (1..$num_files)
{
    my $file = sprintf $format, $i;
    open my $fh, '>', $file or die "Failed to create $file";
    foreach my $n (1..$num_lines)
    {
        printf $fh "%.7d-%.3d\n", $n + 60 * $i, $i;
    }
    close $fh;
}

花的Perl 5.18.0约1.5秒生成10个文件(在MacBook Pro上,2.3 GHz的英特尔酷睿i7,16吉布1333 MHz的内存,老式的旋转磁盘,有用的,但不是非常强大) 。该数据被设计使得存在的文件之间的重叠,而且每一行是唯一的,并确定它是来自该文件(这是为目的的 $偏置和推杆文件号之前的文件中的序列号)。在测试3个文件,比方说,有单从文件1的一些数据之前,有来自文件1和2交错的数据,然后从所有3个文件,那么文件1用完,并提交2.给出了不同的体面覆盖合并算法的部分(但你可以创建更多的场景)。

It took Perl 5.18.0 about 1.5 seconds to generate the 10 files (on a MacBook Pro, 2.3 GHz Intel Core i7, 16 GiB 1333 MHz RAM, good old-fashioned spinning magnetic disk; useful, but not incredibly powerful). The data is designed so that there is overlap between the files, but also each line is unique and identifies the file which it comes from (that's the purpose of the $offset and putting the sequential number within the file before the file number). In testing 3 files, say, there is some data from file 1 alone before there is data from file 1 and 2 interleaved, and then from all 3 files, then file 1 runs out, and file 2. It gives decent coverage of the different parts of the merge algorithm (but you could create more scenarios).

然后,我创建了一个合并程序如下面的'code'的部分。有没有什么办法很花哨。它举起乔恩Bentley的更多的编程珍珠一些堆管理算法;原稿被引述的意见。唯一棘手位是比较常规的意义:引起我一些错误的答案,开始时,和间接的水平

I then created a merge program as shown in the 'Code' section below. There's nothing very fancy about it. It lifts some heap management algorithms from Jon Bentley's 'More Programming Pearls'; the originals are quoted as comments. The only tricky bit is the sense of the comparison routine: that caused me some wrong answers to start with, and the levels of indirection.

在对10个文件运行重新presenting〜数据35 MIB:

When run on the 10 files representing ~35 MiB of data:

$ ls -l datafile.??
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.01
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.02
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.03
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.04
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.05
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.06
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.07
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.08
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.09
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 20:37 datafile.10
$ 
$ for file in datafile.??; do echo $file; sort -c $file; done
datafile.01
datafile.02
datafile.03
datafile.04
datafile.05
datafile.06
datafile.07
datafile.08
datafile.09
datafile.10
$ time ./merge datafile.?? > datafile.out

real    0m0.510s
user    0m0.314s
sys     0m0.074s
$ time ./merge datafile.?? > datafile.out

real    0m0.513s
user    0m0.317s
sys     0m0.080s
$ time sort -m datafile.?? > datafile.merge

real    0m10.061s
user    0m9.960s
sys     0m0.083s
$ time sort -m datafile.?? > datafile.merge

real    0m10.039s
user    0m9.921s
sys     0m0.082s
$ ls -l datafile.*
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.01
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.02
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.03
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.04
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.05
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.06
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.07
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.08
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.09
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 20:37 datafile.10
-rw-r--r--  1 jleffler  staff  36000000 Sep 15 20:42 datafile.merge
-rw-r--r--  1 jleffler  staff  36000000 Sep 15 20:41 datafile.out
$ cmp datafile.out datafile.merge
$ sort -c datafile.out
$ 

国米preting这些结果:

Interpreting these results:


  1. 第一个 LS 清单显示10个数据文件。

  2. 回路检查,输入文件单独排序顺序排列。

  3. 合并我0.5秒左右运行写道:程序。

  4. 系统排序在合并模式( -m )运行在10s左右(我惊讶,它需要很长时间,当我的非优化code处理这么多更快)。

  5. 第二个 LS 房源的显示输出文件的大小相同。

  6. CMP 命令显示输出文件是相同的。

  7. 排序-c 命令显示输出文件都在有序。

  1. The first ls listing shows 10 data files.
  2. The for loop checks that the input files are individually in sorted order.
  3. The merge program I wrote runs in about 0.5s.
  4. The system sort in merge mode (-m) runs in about 10s (and I'm astonished that it takes so long when my non-optimized code handles it so much quicker).
  5. The second ls listing shows that the output files are the same size.
  6. The cmp command shows that the output files are identical.
  7. The sort -c command shows that the output files are in sorted order.

我单独和随后创建101文件,每个文件1,000,000记录时,每12个字节,52秒。我合并在20秒(产生的输出文件的大约1179 MIB - 12.36亿字节)的文件。该系统排序 467(7m47s)合并秒他们(又名'永远')。花了大约90秒排序-c 来检查输出文件是有序。花了 CMP 小于2秒来比较两个大型文件。

I separately and subsequently created 101 files, each with 1,000,000 records of 12 bytes each, in 52 seconds. I merged the files in 20 seconds (generating approximately 1179 MiB of output file — 1,236,000,000 bytes). The system sort merged them in 467 (7m47s) seconds (aka 'forever'). It took about 90 seconds for sort -c to check the output file was in sorted order. It took cmp less than 2 seconds to compare the two large files.

我好奇系统排序是合并如此缓慢。

I'm intrigued that the system sort is so slow on merge.


  • 在我的Mac,它可以从10个输入文件合并数据〜35 MIB在大约半秒。

  • 系统排序可以做的工作在大约10秒。

  • 将推理(因为我的机器已经不再是火箭,如果它曾经是),它应该是可能的合并〜35 MIB在Windows机器上的未满二十秒(让你40差距的因素在性能,我不认为你应该需要)。

  • 所以,你的code服用30秒过长服用远 - 而问题是为什么。你应该看看在努力的算法和数据结构。

  • On my Mac, it is possible to merge ~35 MiB of data from 10 input files in about half a second.
  • The system sort can do the job in about ten seconds.
  • By inference (given that my machine is no longer a rocket, if it ever was), it should be possible to merge ~35 MiB on a Windows machine in under twenty seconds (allowing you a factor of 40 disparity in performance, which I don't think you should need).
  • So, your code taking thirty seconds is taking far too long — and the question is 'why'. You should look hard at your algorithms and data structures.
#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct File
{
    const char *file;
    FILE       *fp;
    char       *line;
    size_t      reserve;    /* Space allocated for line */
    size_t      length;     /* Space used by current line */
} File;

extern void err_exit(const char *fmt, ...);
extern void read_line(File *current);
extern void write_line(File *current);
extern void heapify(size_t *heap, size_t heap_size, File *inputs);
extern void siftup(size_t *heap, size_t lo, size_t hi, File *inputs);
extern void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs);
extern int  compare(File *inputs, size_t i1, size_t i2);

const char *arg0;

int main(int argc, char **argv)
{
    File inputs[argc];
    arg0 = argv[0];

    /* Open files for reading - load first lines */
    for (int i = 1; i < argc; i++)
    {
        File *current = &inputs[i];
        current->file = argv[i];
        if ((current->fp = fopen(current->file, "r")) == 0)
            err_exit("failed to open file %s for reading", current->file);
        current->reserve = 128;
        if ((current->line = malloc(current->reserve)) == 0)
            err_exit("failed to allocate %zu bytes memory", current->reserve);
        current->length = 0;
        read_line(current);
    }

#if defined(CHECK_FUNDAMENTALS)
    for (int i = 1; i < argc; i++)
        printf("%d: %zu - %s\n", i, inputs[i].length, inputs[i].file);
#endif

    size_t heap_size = argc - 1;
    size_t heap[argc];     // heap[0] unused

    for (int i = 1; i < argc; i++)
        heap[i] = i;

#if defined(CHECK_FUNDAMENTALS)
    printf("Heap before:\n");
    for (int i = 1; i < argc; i++)
        printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif

    heapify(heap, heap_size, inputs);

#if defined(CHECK_FUNDAMENTALS)
    printf("Heap after:\n");
    for (int i = 1; i < argc; i++)
        printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif

#if defined(CHECK_FUNDAMENTALS)
    printf("Compare two lines:\n");
    printf("1: %s\n", inputs[1].line);
    printf("2: %s\n", inputs[2].line);
    int r12 = compare(inputs, 1, 2);
    int r21 = compare(inputs, 2, 1);
    int r11 = compare(inputs, 1, 1);
    printf("1 vs 2: %d\n", r12);
    printf("2 vs 1: %d\n", r21);
    printf("1 vs 1: %d\n", r11);
#endif

    while (heap_size > 0)
    {
        File *current = &inputs[heap[1]];
        write_line(current);
        read_line(current);
        if (current->line == 0)
            heap[1] = heap[heap_size--];
        if (heap_size > 0)
        {
            siftdown(heap, 1, heap_size, inputs);
#if defined(CHECK_FUNDAMENTALS)
            printf("Heap check:\n");
            for (int i = 1; i < argc; i++)
                printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif
        }
    }

    return 0;
}

int compare(File *inputs, size_t i1, size_t i2)
{
    return strcmp(inputs[i1].line, inputs[i2].line);
}

void heapify(size_t *heap, size_t heap_size, File *inputs)
{
    for (size_t i = 1; i <= heap_size; i++)
        siftup(heap, 1, i, inputs);
}

/* J Bentley: More Programming Pearls
**
** Heap in array x, indexed from 1, not 0 as is normal in C.
** Implementation will allocate but not use array[0].
**  
**  function siftup(l, u,    i, p) {
**          # pre  maxheap(l, u-1)
**          # post maxheap(l, u)
**          i = u
**          while (1) {
**                  # maxheap(l, u) except between i and its parent
**                  if (i <= l) break
**                  p = int(i/2)
**                  if (x[p] >= x[i]) break
**                  swap(p, i)
**                  i = p
**          }
**  }
**  
**  function siftdown(l, u,  i, c) {
**          # pre  maxheap(l+1, u)
**          # post maxheap(l,u)
**          i = l
**          while (1) {
**                  # maxheap(l, u) except between i and its children
**                  c = 2*i
**                  if (c > u) break
**                  if (c + 1 <= u && x[c+1] > x[c]) c++
**                  if (x[i] >= x[c]) break
**                  swap(c, i)
**                  i = c
**          }
**  }
*/

void siftup(size_t *heap, size_t lo, size_t hi, File *inputs)
{
    size_t i = hi;
    while (1)
    {
        if (i <= lo)
            break;
        size_t p = i / 2;
        if (compare(inputs, heap[p], heap[i]) <= 0)
            break;
        size_t t = heap[p];
        heap[p] = heap[i];
        heap[i] = t;
        i = p;
    }
}

void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs)
{
    size_t i = lo;
    while (1)
    {
        size_t c = 2 * i;
        if (c > hi)
            break;
        if (c + 1 <= hi && compare(inputs, heap[c+1], heap[c]) < 0)
            c++;
        if (compare(inputs, heap[i], heap[c]) <= 0)
            break;
        size_t t = heap[c];
        heap[c] = heap[i];
        heap[i] = t;
        i = c;
    }
}

void read_line(File *current)
{
    char buffer[4096];
    if (fgets(buffer, sizeof(buffer), current->fp) != 0)
    {
        size_t length = strlen(buffer) + 1;
        if (length > current->reserve)
        {
            void *space = realloc(current->line, length);
            if (space == 0)
                err_exit("failed to reallocate %zu bytes memory", length);
            current->line = space;
            current->reserve = length;
        }
        strcpy(current->line, buffer);
        current->length = length - 1;
    }
    else
    {
        fclose(current->fp);
        current->fp = 0;
        free(current->line);
        current->line = 0;
        current->length = 0;
        current->reserve = 0;
    }
}

void write_line(File *current)
{
    if (fwrite(current->line, sizeof(char), current->length, stdout) != current->length)
        err_exit("short write of line from %s of length %zu", current->file, current->length);
}

void err_exit(const char *fmt, ...)
{
    int errnum = errno;
    va_list args;
    fprintf(stderr, "%s: ", arg0);
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    if (errnum != 0)
        fprintf(stderr, " (%d: %s)", errnum, strerror(errnum));
    putc('\n', stderr);
    exit(EXIT_FAILURE);
}

在code设计最多处理线,4昆明植物研究所长;它不会很难修改为使用POSIX 函数getline( ) 处理甚至更长的线路。它假定所有的文件都可以一次打开(这意味着在大多数机器大约250输入文件的上限没有限制调整)。如果它不能打开,而不是做多合并到中间文件的所有文件,也将停止。

The code is designed to handle lines up to 4 KiB long; it would not be hard to revise it to use POSIX getline() to handle even longer lines. It assumes that all the files can be open at once (that means an upper limit of around 250 input files on most machines without tweaking limits). It will stop if it can't open all the files rather than do multiple merges to intermediate files.

这篇关于合并排序的多个文件合并为1排序文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆