MPI 从文本文件中读取 [英] MPI Reading from a text file

查看:19
本文介绍了MPI 从文本文件中读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习在 MPI 中编程,我遇到了这个问题.假设我有一个包含 100,000 行/行的 .txt 文件,我如何将它们分块以供 4 个处理器处理?即我想让处理器 0 处理 0-25000 行,处理器 1 处理 25001-50000 等等.我做了一些搜索,确实遇到了 MPI_File_seek,但我不确定它是否可以在 .txt 上工作并在之后支持 fscanf.

I am learning to program in MPI and I came across this question. Lets say I have a .txt file with 100,000 rows/lines, how do I chunk them for processing by 4 processors? i.e. I want to let processor 0 take care of the processing for lines 0-25000, processor 1 to take care of 25001-50000 and so on. I did some searching and did came across MPI_File_seek but I am not sure can it work on .txt and supports fscanf afterwards.

推荐答案

文本并不是并行处理的好格式,正是因为您不知道(比如说)第 25001 行从哪里开始.所以这类问题通常会通过一些预处理步骤提前处理,要么构建索引,要么将文件划分为适当数量的块供每个进程读取.

Text isn't a great format for parallel processing exactly because you don't know ahead of time where (say) line 25001 begins. So these sorts of problems are often dealt with ahead of time through some preprocessing step, either building an index or partitioning the file into the appropriate number of chunks for each process to read.

如果你真的想通过 MPI 来做到这一点,我建议使用 MPI-IO 将文本文件的重叠块读取到各种处理器上,其中重叠比你预期的最长行要长得多,然后让每个处理器就从哪里开始达成一致;例如,您可以说进程 N 和 N+1 共享的重叠区域中的第一个(或最后一个)新行是进程 N 离开和 N+1 开始的地方.

If you really want to do it through MPI, I'd suggest using MPI-IO to read in overlapping chunks of the text file onto the various processors, where the overlap is much longer than you expect your longest line to be, and then have each processor agree on where to start; eg, you could say that the first (or last) new line in the overlap region shared by processes N and N+1 is where process N leaves off and N+1 starts.

要跟进一些代码,

#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
    
void parprocess(MPI_File *in, MPI_File *out, const int rank, const int size, const int overlap) {
    MPI_Offset globalstart;
    int mysize;
    char *chunk;
    
    /* read in relevant chunk of file into "chunk",
     * which starts at location in the file globalstart
     * and has size mysize 
     */
    {
        MPI_Offset globalend;
        MPI_Offset filesize;
    
        /* figure out who reads what */
        MPI_File_get_size(*in, &filesize);
        filesize--;  /* get rid of text file eof */
        mysize = filesize/size;
        globalstart = rank * mysize;
        globalend   = globalstart + mysize - 1;
        if (rank == size-1) globalend = filesize-1;
    
        /* add overlap to the end of everyone's chunk except last proc... */
        if (rank != size-1)
            globalend += overlap;
    
        mysize =  globalend - globalstart + 1;
    
        /* allocate memory */
        chunk = malloc( (mysize + 1)*sizeof(char));
    
        /* everyone reads in their part */
        MPI_File_read_at_all(*in, globalstart, chunk, mysize, MPI_CHAR, MPI_STATUS_IGNORE);
        chunk[mysize] = '';
    }
    
    
    /*
     * everyone calculate what their start and end *really* are by going 
     * from the first newline after start to the first newline after the
     * overlap region starts (eg, after end - overlap + 1)
     */
    
    int locstart=0, locend=mysize-1;
    if (rank != 0) {
        while(chunk[locstart] != '
') locstart++;
        locstart++;
    }
    if (rank != size-1) {
        locend-=overlap;
        while(chunk[locend] != '
') locend++;
    }
    mysize = locend-locstart+1;
    
    /* "Process" our chunk by replacing non-space characters with '1' for
     * rank 1, '2' for rank 2, etc... 
     */
    
    for (int i=locstart; i<=locend; i++) {
        char c = chunk[i];
        chunk[i] = ( isspace(c) ? c : '1' + (char)rank );
    }

    
    /* output the processed file */
    
    MPI_File_write_at_all(*out, (MPI_Offset)(globalstart+(MPI_Offset)locstart), &(chunk[locstart]), mysize, MPI_CHAR, MPI_STATUS_IGNORE);
    
    return;
}
    
int main(int argc, char **argv) {
    
    MPI_File in, out;
    int rank, size;
    int ierr;
    const int overlap = 100;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    if (argc != 3) {
        if (rank == 0) fprintf(stderr, "Usage: %s infilename outfilename
", argv[0]);
        MPI_Finalize();
        exit(1);
    }
    
    ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
    if (ierr) {
        if (rank == 0) fprintf(stderr, "%s: Couldn't open file %s
", argv[0], argv[1]);
        MPI_Finalize();
        exit(2);
    }
    
    ierr = MPI_File_open(MPI_COMM_WORLD, argv[2], MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &out);
    if (ierr) {
        if (rank == 0) fprintf(stderr, "%s: Couldn't open output file %s
", argv[0], argv[2]);
        MPI_Finalize();
        exit(3);
    }
    
    parprocess(&in, &out, rank, size, overlap);
    
    MPI_File_close(&in);
    MPI_File_close(&out);
    
    MPI_Finalize();
    return 0;
}

在问题文本的缩小版本上运行这个,我们得到

Running this on a narrow version of the text of the question, we get

$ mpirun -n 3 ./textio foo.in foo.out
$ paste foo.in foo.out
Hi guys I am learning to            11 1111 1 11 11111111 11
program in MPI and I came           1111111 11 111 111 1 1111
across this question. Lets          111111 1111 111111111 1111
say I have a .txt file with         111 1 1111 1 1111 1111 1111
100,000 rows/lines, how do          1111111 11111111111 111 11
I chunk them for processing         1 11111 1111 111 1111111111
by 4 processors? i.e. I want        22 2 22222222222 2222 2 2222
to let processor 0 take care        22 222 222222222 2 2222 2222
of the processing for lines         22 222 2222222222 222 22222
0-25000, processor 1 to take        22222222 222222222 2 22 2222
care of 25001-50000 and so          2222 22 22222222222 222 22
on. I did some searching and        333 3 333 3333 333333333 333
did came across MPI_File_seek       333 3333 333333 3333333333333
but I am not sure can it work       333 3 33 333 3333 333 33 3333
on .txt and supports fscanf         33 3333 333 33333333 333333
afterwards.                         33333333333

这篇关于MPI 从文本文件中读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆