从大文件读取几行数据的最快方法是什么 [英] What is the fastest way to read several lines of data from a large file

查看:72
本文介绍了从大文件读取几行数据的最快方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用程序需要从300GB左右的大型csv文件中读取数千行,其中包含十亿行,每行包含多个数字.数据如下:

My application needs to read like thousands of lines from a large csv file around 300GB with billion lines, each line contains several numbers. The data are like these:

1, 34, 56, 67, 678, 23462, ...
2, 3, 6, 8, 34, 5
23,547, 648, 34657 ...
...
...

我尝试过 fget 在c中逐行读取文件,但是即使在Linux中使用 wc -l <​​/code>时,它也确实花费了非常长的时间,只是为了读取所有线,花了相当长的时间.

I tried fget reading file line by line in c, but it took really really really long, even with wc -l in linux, just to read all of the line, it took quite a while.

我还尝试根据应用程序的逻辑将所有数据写入 sqlite3 数据库.但是,数据结构与上面的csv文件不同,后者现在有1000亿行,每行只有两个数字.然后,我在它们之上创建了两个索引,这产生了2.5TB的数据库,而以前是没有索引的1TB.由于索引的规模比数据大,因此查询必须读取整个1.5 TB的索引,我认为使用数据库方法没有任何意义吧?

I also tried to write all data to sqlite3 database based on the logics of the application. However, the data structure is different than the csv file above, which now has 100 billion lines, with only two numbers each line. I then created two indices on top of them, which resulted a 2.5TB database, while it was 1 TB without indices before. Since the scale of indices are large than data, query has to read the whole 1.5 TB indices, I think it doesn't make any sense to use database method right?

所以我想问一下,用C或python读取十亿行的大型csv文件中最快的几行是什么?顺便说一句,有什么公式或什么可以计算读取文件和RAM容量之间的时间消耗.

So I would like to ask, what is the quickest way to read several lines within a large csv file with billion lines in C or python. And by the way, is there any formula or something to calculate the time consume between reading file and capacity of RAM.

环境:Linux,RAM 200GB,C,python

environment: linux, RAM 200GB, C, python

推荐答案

要求

  • 巨大的csv文件,大小为几百GB
  • 每行包含几个数字
  • 程序每次运行必须提取数千行
  • 程序在同一个文件上可以多次工作,只能提取不同的行

由于csv文件中的行长度是可变的,因此您必须读取整个文件才能获取所需行的数据.整个文件的顺序读取仍然非常慢-即使您尽可能优化了文件读取.一个很好的指标实际上是wc -l的运行时间,正如OP在问题中已经提到的那样.

Since lines in the csv files have a variable length, you would have to read the entire file to get the data of the required lines. Sequential reading of the entire file would still be very slow - even if you optimized the file reading as much as possible. A good indicator is actually the runtime of wc -l, as mentioned already by the OP in the question.

相反,应该在算法级别上进行优化.必须对数据进行一次性预处理,这样才能快速访问某些行,而无需读取整个文件.

Instead, one should optimize on the algorithmic level. A one-time preprocessing of the data is necessary, which then allows fast access to certain lines - without reading the whole file.

有几种可能的方法,例如:

There are several possible ways, for example:

  1. 使用具有索引的数据库
  2. 以编程方式创建索引文件(行号与文件偏移量的关联)
  3. 将csv文件转换为固定格式的二进制文件

OP测试表明,方法1)导致产生1.5 TB的索引.方法2),创建一个将程序行号与文件偏移量连接起来的小程序,当然也是可行的.最后,方法3将允许计算文件到行号的偏移量,而无需单独的索引文件.如果已知每行的最大数目,则此方法特别有用.否则,方法2和方法3非常相似.

The OP test shows that approach 1) led to 1.5 TB indices. Method 2), to create a small program that connects the line number with a file offset is certainly also a possibility. Finally, approach 3 would allow to calculate the file offset to a line number without the need for a separate index file. This approach is especially useful if the maximum number of numbers per line is known. Otherwise, approach 2 and approach 3 are very similar.

下面将详细介绍方法3.可能还有其他要求,要求对该方法进行略微修改,但以下内容应使事情开始.

Approach 3 is explained in more detail below. There may be additional requirements that require the approach to be slightly modified, but the following should get things started.

一次一次性预处理是必要的.文本csv行将转换为int数组,并使用固定记录格式将int以二进制格式存储在单独的文件中.然后,要读取特定行 n ,您可以简单地计算文件的偏移量,例如使用 line_nr *(sizeof(int)* MAX_NUMBERS_PER_LINE); .最后,使用 fseeko(fp,offset,SEEK_SET); 跳转到该偏移量并读取MAX_NUMBERS_PER_LINE个整数.因此,您只需要读取您实际要处理的数据即可.

A one-time pre-processing is necessary. The textual csv lines are converted into int arrays and use a fixed record format to store the ints in binary format in a separate file. To then read a particular line n, you can simply calculate the file offset, e.g. with line_nr * (sizeof(int) * MAX_NUMBERS_PER_LINE);. Finally, with fseeko(fp, offset, SEEK_SET); jump to this offset and read MAX_NUMBERS_PER_LINE ints. So you only need to read the data that you actually want to process.

这不仅具有程序运行速度更快的优点,而且还需要很少的主内存.

This has not only the advantage that the program runs much faster, it also requires very little main memory.

测试用例

创建了具有3,000,000,000行的测试文件.每行最多包含10个随机整数,用逗号分隔.

A test file with 3,000,000,000 lines was created. Each line contains up to 10 random int numbers, separated by a comma.

在这种情况下,这提供了一个约342 GB数据的csv文件.

In this case this gave a csv file with about 342 GB of data.

快速测试

time wc -l numbers.csv 

给予

187.14s user 74.55s system 96% cpu 4:31.48 total

这意味着,如果使用顺序文件读取方法,则总共至少需要4.5分钟.

This means that it would take a total of at least 4.5 minutes if a sequential file read approach were used.

对于一次性预处理,转换器程序读取每行并每行存储10个二进制int.转换后的文件称为"numbers_bin".快速测试,可以访问10,000个随机选择的行的数据:

For one-time preprocessing, a converter program reads each line and stores 10 binary ints per line. The converted file is called 'numbers_bin'. A quick test with access to the data of 10,000 randomly selected rows:

time demo numbers_bin

给予

0.03s user 0.20s system 5% cpu 4.105 total

因此,此特定示例数据花了4.1秒,而不是4.5分钟.那快了65倍.

So instead of 4.5 minutes, it takes 4.1 seconds for this specific example data. That is more than a factor of 65 faster.

源代码

这种方法听起来可能比实际复杂得多.

This approach may sound more complicated than it actually is.

让我们从转换器程序开始.它会读取csv文件并创建一个二进制固定格式的文件.

Let's start with the converter program. It reads the csv file and creates a binary fixed format file.

有趣的部分发生在函数pre_process中:用"getline"在循环中读取一行,用"strtok"和"strtol"提取数字并将其放入以0初始化的int数组中.数组使用'fwrite'写入输出文件.

The interesting part takes place in the function pre_process: there a line is read in a loop with 'getline', the numbers are extracted with 'strtok' and 'strtol' and put into an int array initialized with 0. Finally this array is written to the output file with 'fwrite'.

转换过程中的错误导致在stderr上显示一条消息,程序终止.

Errors during the conversion result in a message on stderr and the program is terminated.

convert.c

#include "data.h"
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>

static void pre_process(FILE *in, FILE *out) {
    int *block = get_buffer();
    char *line = NULL;
    size_t line_capp = 0;

    while (getline(&line, &line_capp, in) > 0) {
        line[strcspn(line, "\n")] = '\0';
        memset(block, 0, sizeof(int) * MAX_ELEMENTS_PER_LINE);

        char *token;
        char *ptr = line;
        int i = 0;
        while ((token = strtok(ptr, ", ")) != NULL) {
            if (i >= MAX_ELEMENTS_PER_LINE) {
                fprintf(stderr, "too many elements in line");
                exit(EXIT_FAILURE);
            }
            char *end_ptr;
            errno = 0;
            long val = strtol(token, &end_ptr, 10);
            if (val > INT_MAX || val < INT_MIN || errno || *end_ptr != '\0' || end_ptr == token) {
                fprintf(stderr, "value error with '%s'\n", token);
                exit(EXIT_FAILURE);
            }
            ptr = NULL;
            block[i] = (int) val;
            i++;
        }
        fwrite(block, sizeof(int), MAX_ELEMENTS_PER_LINE, out);
    }
    free(block);
    free(line);
}


static void one_off_pre_processing(const char *csv_in, const char *bin_out) {
    FILE *in = get_file(csv_in, "rb");
    FILE *out = get_file(bin_out, "wb");
    pre_process(in, out);
    fclose(in);
    fclose(out);
}

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "usage: convert <in> <out>\n");
        exit(EXIT_FAILURE);
    }
    one_off_pre_processing(argv[1], argv[2]);
    return EXIT_SUCCESS;
}

Data.h

使用了一些辅助功能.他们或多或少是不言自明的.

A few auxiliary functions are used. They are more or less self-explanatory.

#ifndef DATA_H
#define DATA_H

#include <stdio.h>
#include <stdint.h>

#define NUM_LINES 3000000000LL
#define MAX_ELEMENTS_PER_LINE 10

void read_data(FILE *fp, uint64_t line_nr, int *block);
FILE *get_file(const char *const file_name, char *mode);
int *get_buffer();

#endif //DATA_H

Data.c

#include "data.h"
#include <stdlib.h>

void read_data(FILE *fp, uint64_t line_nr, int *block) {
    off_t offset = line_nr * (sizeof(int) * MAX_ELEMENTS_PER_LINE);
    fseeko(fp, offset, SEEK_SET);
    if(fread(block, sizeof(int), MAX_ELEMENTS_PER_LINE, fp) != MAX_ELEMENTS_PER_LINE) {
        fprintf(stderr, "data read error for line %lld", line_nr);
        exit(EXIT_FAILURE);
    }
}

FILE *get_file(const char *const file_name, char *mode) {
    FILE *fp;
    if ((fp = fopen(file_name, mode)) == NULL) {
        perror(file_name);
        exit(EXIT_FAILURE);
    }
    return fp;
}

int *get_buffer() {
    int *block = malloc(sizeof(int) * MAX_ELEMENTS_PER_LINE);
    if(block == NULL) {
        perror("malloc failed");
        exit(EXIT_FAILURE);
    }
    return block;
}

demo.c

最后是一个演示程序,它读取10,000条随机确定的行中的数据.

And finally a demo program that reads the data for 10,000 randomly determined lines.

request_lines函数可确定10,000条随机行.这些行用qsort排序.读取这些行的数据.代码的某些行已被注释掉.如果您将它们注释掉,则读取的数据将输出到调试控制台.

The function request_lines determines 10,000 random lines. The lines are sorted with qsort. The data for these lines is read. Some lines of the code are commented out. If you comment them out, the read data is output to the debug console.

#include "data.h"
#include <stdlib.h>
#include <assert.h>
#include <sys/stat.h>


static int comp(const void *lhs, const void *rhs) {
    uint64_t l = *((uint64_t *) lhs);
    uint64_t r = *((uint64_t *) rhs);
    if (l > r) return 1;
    if (l < r) return -1;
    return 0;
}

static uint64_t *request_lines(uint64_t num_lines, int num_request_lines) {
    assert(num_lines < UINT32_MAX);
    uint64_t *request_lines = malloc(sizeof(*request_lines) * num_request_lines);

    for (int i = 0; i < num_request_lines; i++) {
        request_lines[i] = arc4random_uniform(num_lines);
    }
    qsort(request_lines, num_request_lines, sizeof(*request_lines), comp);

    return request_lines;
}


#define REQUEST_LINES 10000

int main(int argc, char *argv[]) {

    if (argc != 2) {
        fprintf(stderr, "usage: demo <file>\n");
        exit(EXIT_FAILURE);
    }

    struct stat stat_buf;
    if (stat(argv[1], &stat_buf) == -1) {
        perror(argv[1]);
        exit(EXIT_FAILURE);
    }

    uint64_t num_lines = stat_buf.st_size / (MAX_ELEMENTS_PER_LINE * sizeof(int));

    FILE *bin = get_file(argv[1], "rb");
    int *block = get_buffer();

    uint64_t *requests = request_lines(num_lines, REQUEST_LINES);
    for (int i = 0; i < REQUEST_LINES; i++) {
        read_data(bin, requests[i], block);
        //do sth with the data, 
        //uncomment the following lines to output the data to the console
//        printf("%llu: ", requests[i]);
//        for (int x = 0; x < MAX_ELEMENTS_PER_LINE; x++) {
//            printf("'%d' ", block[x]);
//        }
//        printf("\n");
    }

    free(requests);
    free(block);
    fclose(bin);

    return EXIT_SUCCESS;
}

摘要

与顺序读取整个文件相比,此方法提供的结果要快得多(示例数据每次运行需要4秒而不是每次4.5分钟).它还需要很少的主内存.

This approach provides much faster results than reading through the entire file sequentially (4 seconds instead of 4.5 minutes per run for the sample data). It also requires very little main memory.

先决条件是将数据一次性预处理为二进制格式.这种转换非常耗时,但是之后可以使用查询程序快速读取某些行的数据.

The prerequisite is the one-time pre-processing of the data into a binary format. This conversion is quite time-consuming, but the data for certain rows can be read very quickly afterwards using a query program.

这篇关于从大文件读取几行数据的最快方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆