使用并行I / O复制大型数据文件 [英] Copy large data file using parallel I/O
问题描述
我有一个相当大的数据集,大约141M行与.csv格式化。我想使用 MPI 命令与C ++来复制和操作几个列,但我是一个
到目前为止,我的代码看起来像这样
#include< stdio.h>
#includempi.h
using namespace std;
int main(int argc,char ** argv)
{
int i,rank,nprocs,size,offset,nints,bufsize,N = 4;
MPI_File fp,fpwrite; //文件指针
MPI_Status status;
MPI_Offset filesize;
MPI_Init(& argc,& argv);
MPI_Comm_rank(MPI_COMM_WORLD,& rank);
MPI_Comm_size(MPI_COMM_WORLD,& nprocs);
MPI_File_get_size(fp,& fileize);
int buf [N];
for(i = 0; i buf [i] = i;
offset = rank *(N / size)* sizeof(int);
MPI_File_open(MPI_COMM_WORLD,new.csv,MPI_MODE_RDONLY,MPI_INFO_NULL,& fp);
MPI_File_open(MPI_COMM_WORLD,Ntest.csv,MPI_MODE_CREATE | MPI_MODE_WRONLY,MPI_INFO_NULL,& fpwrite);
MPI_File_read(fp,buf,N,MPI_INT,& status);
// printf(\\\
rank:%d,buf [%d]:%d\\\
,rank,rank * bufsize,buf [0]);
printf(我的排名是:%d \\\
,rank);
MPI_File_write_at(fpwrite,offset,buf,(N / size),MPI_INT,& status);
/ * //重复该过程
MPI_Barrier(MPI_COMM_WORLD);
printf(2 /我的排名是:%d \\\
,rank); * /
MPI_File_close(& fp);
MPI_File_close(& fpwrite);
MPI_Finalize();
}
我不知道从哪里开始,光泽条纹的例子。如果可能,我想去那个方向。其他选项包括HDF5和T3PIO。
你过早地担心光泽条纹,除了光泽条纹对于并行文件系统来说,默认是小得可笑的。使用 lfs setstripe
增加要写入和读取这些文件的目录的条带大小您的第一个挑战如何分解此CSV文件。典型的行是什么样子的?如果行的长度可变,你会有点头痛。原因如下:
考虑一个包含3行和3个MPI进程的CSV文件。
- 一行是
aa,b,c
(8字节)。 - 行为
aaaaaaa,bbbbbbb,ccccccc
(24个字节)。 - 第三行是
,, c
(4字节)。
(darnit,markdown,我如何使此列表从零开始?)
Rank 0可以从文件的开头读取,但是在哪里会排序1和2开始?如果你只是将总大小(8 + 24 + 4 = 36)除以3,则分解为
- c $ c> aa,b,c\\\
aaaaaa , - 1读取
a,bbbbbbb,ccc
和 - 读取
cccc\\\
,, c\\\
- One row is
aa,b,c
(8 bytes). - row is
aaaaaaa,bbbbbbb,ccccccc
(24 bytes). - third row is
,,c
(4 bytes) . - 0 ends up reading
aa,b,c\naaaaaa
, - 1 reads
a,bbbbbbb,ccc
, and - reads
cccc\n,,c\n
非结构化文本输入的两种方法如下。一个选项是在事实之后或在生成文件时对文件编制索引。该索引将存储每行的开始偏移。等级0读取偏移量,然后广播给其他人。
第二个选项是按文件大小进行初始分解,然后修正分割。在上面的简单例子中,秩0会将换行符之后的所有内容发送到秩1.秩1将接收新数据并将其粘贴到其行的开始,并将所有内容发送到它自己的换行符之后到达秩2.这是非常费力的,我不建议它为刚开始MPI-IO的人。
HDF5是一个很好的选择!不要尝试编写自己的并行CSV解析器,请让您的CSV创建者生成HDF5数据集。 HDF5,其他功能,将保持索引我提到你,所以你可以设置hyperslabs,做并行阅读和写作。
I have a fairly big data set, about 141M lines with .csv formatted. I want to use MPI commands with C++ to copy and manipulate a few columns, but I'm a newbie on both C++ and MPI.
So far my code looks like this
#include <stdio.h>
#include "mpi.h"
using namespace std;
int main(int argc, char **argv)
{
int i, rank, nprocs, size, offset, nints, bufsize, N=4;
MPI_File fp, fpwrite; // File pointer
MPI_Status status;
MPI_Offset filesize;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_File_get_size(fp, &filesize);
int buf[N];
for (i = 0; i<N; i++)
buf[i] = i;
offset = rank * (N/size)*sizeof(int);
MPI_File_open(MPI_COMM_WORLD, "new.csv", MPI_MODE_RDONLY, MPI_INFO_NULL, &fp);
MPI_File_open(MPI_COMM_WORLD, "Ntest.csv", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fpwrite);
MPI_File_read(fp, buf, N, MPI_INT, &status);
// printf("\nrank: %d, buf[%d]: %d\n", rank, rank*bufsize, buf[0]);
printf("My rank is: %d\n", rank);
MPI_File_write_at(fpwrite, offset, buf, (N/size), MPI_INT, &status);
/* // repeat the process again
MPI_Barrier(MPI_COMM_WORLD);
printf("2/ My rank is: %d\n", rank); */
MPI_File_close(&fp);
MPI_File_close(&fpwrite);
MPI_Finalize();
}
I'm not sure where to start, and I've seen a few examples with lustre stripes. I would like to go that direction if possible. Additional options include HDF5 and T3PIO.
You are way too early to worry about lustre stripes, aside from the fact that lustre stripes are by default something ridiculously small for a "parallel file system". Increase the stripe size of the directory where you will write and read these files with lfs setstripe
Your first challenge will be how to decompose this CSV file. What does a typical row look like? If the rows are of variable length, you're going to have a bit of a headache. Here's why:
consider a CSV file with 3 rows and 3 MPI processes.
(darnit, markdown, how do I make this list start at zero?)
Rank 0 can read from the beginning of the file, but where will rank 1 and 2 start? If you simply divide total size (8+24+4=36) by 3, then the decomposistion is
The two approaches to unstructured text input are as follows. One option is to index your file, either after the fact or as the file is being generated. This index would store the beginning offset of every row. Rank 0 reads the offset then broadcasts to everyone else.
The second option is to do this initial decomposition by file size, then fix up the splits. In the above simple example, rank 0 would send everything after the newline to rank 1. Rank 1 would receive the new data and glue it to the beginning of its row and send everything after its own newline to rank 2. This is extremely fiddly and I would not suggest it for someone just starting MPI-IO.
HDF5 is a good option here! Instead of trying to write your own parallel CSV parser, have your CSV creator generate an HDF5 dataset. HDF5, among other features, will keep that index i mentioned for you, so you can set up hyperslabs and do parallel reading and writing.
这篇关于使用并行I / O复制大型数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!