使用并行I / O复制大型数据文件 [英] Copy large data file using parallel I/O

查看:149
本文介绍了使用并行I / O复制大型数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集,大约141M行与.csv格式化。我想使用 MPI 命令与C ++来复制和操作几个列,但我是一个



到目前为止,我的代码看起来像这样

  #include< stdio.h> 
#includempi.h

using namespace std;

int main(int argc,char ** argv)
{
int i,rank,nprocs,size,offset,nints,bufsize,N = 4;
MPI_File fp,fpwrite; //文件指针
MPI_Status status;
MPI_Offset filesize;
MPI_Init(& argc,& argv);
MPI_Comm_rank(MPI_COMM_WORLD,& rank);
MPI_Comm_size(MPI_COMM_WORLD,& nprocs);
MPI_File_get_size(fp,& fileize);

int buf [N];
for(i = 0; i buf [i] = i;
offset = rank *(N / size)* sizeof(int);
MPI_File_open(MPI_COMM_WORLD,new.csv,MPI_MODE_RDONLY,MPI_INFO_NULL,& fp);

MPI_File_open(MPI_COMM_WORLD,Ntest.csv,MPI_MODE_CREATE | MPI_MODE_WRONLY,MPI_INFO_NULL,& fpwrite);

MPI_File_read(fp,buf,N,MPI_INT,& status);

// printf(\\\
rank:%d,buf [%d]:%d\\\
,rank,rank * bufsize,buf [0]);
printf(我的排名是:%d \\\
,rank);
MPI_File_write_at(fpwrite,offset,buf,(N / size),MPI_INT,& status);

/ * //重复该过程
MPI_Barrier(MPI_COMM_WORLD);
printf(2 /我的排名是:%d \\\
,rank); * /

MPI_File_close(& fp);
MPI_File_close(& fpwrite);
MPI_Finalize();
}

我不知道从哪里开始,光泽条纹的例子。如果可能,我想去那个方向。其他选项包括HDF5和T3PIO。

解决方案

你过早地担心光泽条纹,除了光泽条纹对于并行文件系统来说,默认是小得可笑的。使用 lfs setstripe


增加要写入和读取这些文件的目录的条带大小您的第一个挑战如何分解此CSV文件。典型的行是什么样子的?如果行的长度可变,你会有点头痛。原因如下:



考虑一个包含3行和3个MPI进程的CSV文件。


  1. 一行是 aa,b,c (8字节)。

  2. 行为 aaaaaaa,bbbbbbb,ccccccc (24个字节)。

  3. 第三行是 ,, c (4字节)。

(darnit,markdown,我如何使此列表从零开始?)



Rank 0可以从文件的开头读取,但是在哪里会排序1和2开始?如果你只是将总大小(8 + 24 + 4 = 36)除以3,则分解为


  1. c $ c> aa,b,c\\\
    aaaaaa ,

  2. 1读取 a,bbbbbbb,ccc

  3. 读取 cccc\\\
    ,, c\\\


  4. 非结构化文本输入的两种方法如下。一个选项是在事实之后或在生成文件时对文件编制索引。该索引将存储每行的开始偏移。等级0读取偏移量,然后广播给其他人。



    第二个选项是按文件大小进行初始分解,然后修正分割。在上面的简单例子中,秩0会将换行符之后的所有内容发送到秩1.秩1将接收新数据并将其粘贴到其行的开始,并将所有内容发送到它自己的换行符之后到达秩2.这是非常费力的,我不建议它为刚开始MPI-IO的人。



    HDF5是一个很好的选择!不要尝试编写自己的并行CSV解析器,请让您的CSV创建者生成HDF5数据集。 HDF5,其他功能,将保持索引我提到你,所以你可以设置hyperslabs,做并行阅读和写作。


    I have a fairly big data set, about 141M lines with .csv formatted. I want to use MPI commands with C++ to copy and manipulate a few columns, but I'm a newbie on both C++ and MPI.

    So far my code looks like this

    #include <stdio.h>
    #include "mpi.h"
    
    using namespace std;
    
    int main(int argc, char **argv)
    {
        int i, rank, nprocs, size, offset, nints, bufsize, N=4;
        MPI_File fp, fpwrite; // File pointer
        MPI_Status status;
        MPI_Offset filesize;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
        MPI_File_get_size(fp, &filesize);
    
        int buf[N];
        for (i = 0; i<N; i++)
            buf[i] = i;
        offset = rank * (N/size)*sizeof(int);
        MPI_File_open(MPI_COMM_WORLD, "new.csv", MPI_MODE_RDONLY, MPI_INFO_NULL, &fp);
    
        MPI_File_open(MPI_COMM_WORLD, "Ntest.csv", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fpwrite);
    
        MPI_File_read(fp, buf, N, MPI_INT, &status);
    
        // printf("\nrank: %d, buf[%d]: %d\n", rank, rank*bufsize, buf[0]);
        printf("My rank is: %d\n", rank);
        MPI_File_write_at(fpwrite, offset, buf, (N/size), MPI_INT, &status);
    
        /* // repeat the process again
        MPI_Barrier(MPI_COMM_WORLD);
        printf("2/ My rank is: %d\n", rank); */
    
        MPI_File_close(&fp);
        MPI_File_close(&fpwrite);
        MPI_Finalize();
    }
    

    I'm not sure where to start, and I've seen a few examples with lustre stripes. I would like to go that direction if possible. Additional options include HDF5 and T3PIO.

    解决方案

    You are way too early to worry about lustre stripes, aside from the fact that lustre stripes are by default something ridiculously small for a "parallel file system". Increase the stripe size of the directory where you will write and read these files with lfs setstripe

    Your first challenge will be how to decompose this CSV file. What does a typical row look like? If the rows are of variable length, you're going to have a bit of a headache. Here's why:

    consider a CSV file with 3 rows and 3 MPI processes.

    1. One row is aa,b,c (8 bytes).
    2. row is aaaaaaa,bbbbbbb,ccccccc (24 bytes).
    3. third row is ,,c (4 bytes) .

    (darnit, markdown, how do I make this list start at zero?)

    Rank 0 can read from the beginning of the file, but where will rank 1 and 2 start? If you simply divide total size (8+24+4=36) by 3, then the decomposistion is

    1. 0 ends up reading aa,b,c\naaaaaa,
    2. 1 reads a,bbbbbbb,ccc, and
    3. reads cccc\n,,c\n

    The two approaches to unstructured text input are as follows. One option is to index your file, either after the fact or as the file is being generated. This index would store the beginning offset of every row. Rank 0 reads the offset then broadcasts to everyone else.

    The second option is to do this initial decomposition by file size, then fix up the splits. In the above simple example, rank 0 would send everything after the newline to rank 1. Rank 1 would receive the new data and glue it to the beginning of its row and send everything after its own newline to rank 2. This is extremely fiddly and I would not suggest it for someone just starting MPI-IO.

    HDF5 is a good option here! Instead of trying to write your own parallel CSV parser, have your CSV creator generate an HDF5 dataset. HDF5, among other features, will keep that index i mentioned for you, so you can set up hyperslabs and do parallel reading and writing.

    这篇关于使用并行I / O复制大型数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆