如何在独立处理行时从输入文件并行读取行? [英] How to parallelize reading lines from an input file when lines get independently processed?

查看:162
本文介绍了如何在独立处理行时从输入文件并行读取行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用C ++的OpenMP。我在C ++中的序列代码看起来像这样:

I just started off with OpenMP using C++. My serial code in C++ looks something like this:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        while(getline(inputfile, line)) {
            // Line gets processed and written into an output file
        }
    }
}

因为每行都被独立处理,尝试使用OpenMP并行化,因为输入文件的大小为千兆字节。所以我猜,首先我需要得到输入文件中的行数,然后并行化代码这种方式。有人可以帮我到这里吗?

Because each line is pretty much independently processed, I was attempting to use OpenMP to parallelize this because the input file is in the order of gigabytes. So I'm guessing that first I need to get the number of lines in the input file and then parallelize the code this way. Can someone please help me out here?

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        //Calculate number of lines in file?
        //Set an output filename and open an ofstream
        #pragma omp parallel num_threads(8)
        {
            #pragma omp for schedule(dynamic, 1000)
            for(int i = 0; i < lines_in_file; i++) {
                 //What do I do here? I cannot just read any line because it requires random access
            }
        }
    }
}

编辑:

重要事项


  1. 每行都独立处理

  2. 结果顺序无所谓


推荐答案

不是一个直接的OpenMP答案 - 但你可能想找的是 Map / Reduce 方法。查看 Hadoop - 这是在Java中完成的,但有一些C ++ API最少。

Not a direct OpenMP answer - but what you are probably looking for is Map/Reduce approach. Take a look at Hadoop - it's done in Java, but there's some C++ API at least.

通常,您希望在不同的计算机上处​​理此数据量,而不是在同一进程中的多个线程中处理(虚拟地址空间限制,缺少物理内存,交换等)内核也必须依次带来磁盘文件(你想要的 - 否则硬盘驱动器只需要为每个线程做额外的寻找)。

In general, you want to process this amount of data on different machines, not in multiple threads in the same process (virtual address space limitations, lack of physical memory, swapping, etc.) Also the kernel will have to bring the disk file in sequentially anyway (which you want - otherwise the hard-drive will just have to do extra seeks for each of your threads).

这篇关于如何在独立处理行时从输入文件并行读取行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆