C ++-如何对文件进行分块以进行同时/异步处理? [英] C++ - How to chunk a file for simultaneous/async processing?

查看:59
本文介绍了C ++-如何对文件进行分块以进行同时/异步处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个人如何按行数读取和拆分/压缩文件?

我想将文件分区到单独的缓冲区中,同时确保在两个或多个缓冲区之间不分割一行.我计划将这些缓冲区传递到它们自己的pthread中,以便它们可以执行某种类型的同时/异步处理.

我已阅读以下答案在在Linux上使用c 的代码块,但我认为它不能完全回答有关确保将一行不拆分为两个或更多缓冲区的问题.

解决方案

我会选择块大小(以字节为单位).然后,我将寻找文件中的适当位置,并一次读取少量的字节,直到获得换行符为止.

第一个块的最后一个字符是换行符.第二个块的第一个字符是换行符之后的字符.

始终寻求到pagesize()边界并一次读入pagesize()字节以搜索换行符.这将倾向于确保您仅从磁盘上拉出最低限度来查找边界.您可以尝试一次读取128字节左右的数据.但是您可能会冒更多系统调用的风险.

我编写了一个示例程序来执行字母频率计数.当然,将其拆分为线程几乎毫无意义,因为几乎可以确定它是受IO约束的.而且换行符在哪里也无关紧要,因为它不是面向行的.但是,这只是一个例子.另外,它在很大程度上取决于您拥有一个相当完整的C ++ 11实现.

  • threaded_file_split.cpp

它们的主要功能是:

 //找到给定特定所需偏移量的下一个换行符的偏移量.off_t next_linestart(int fd,off_t start){使用:: std :: size_t;使用:: ssize_t;使用:: pread;const size_t bufsize = 4096;char buf [bufsize];对于(bool found = false;!found;){const ssize_t result = pread(fd,buf,bufsize,start);如果(结果< 0){抛出:: std :: system_error(errno,:: std :: system_category(),读取失败,试图找到换行符.");} else if(结果== 0){//文件结尾发现=真;} 别的 {const char * const nl_loc = :: std :: find(buf,buf + result,'\ n');if(nl_loc!=(buf +结果)){开始+ =((nl_loc-buf)+1);发现=真;} 别的 {开始+ =结果;}}}返回开始;} 

还要注意,我使用 pread .当您有多个线程从文件的不同部分读取时,这绝对是必不可少的.

文件描述符是线程之间的共享资源.当一个线程使用普通功能从文件读取时,它将更改有关此共享资源(文件指针)的详细信息.文件指针是文件中将发生下一次读取的位置.

每次阅读前仅使用 lseek 都无济于事,因为这会在 lseek read 之间引入竞争条件./p>

pread 函数允许您从文件中的特定位置读取一堆字节.它也根本不会改变文件指针.除了不改变文件指针这一事实外,它还类似于在同一调用中组合 lseek read .

How does one read and split/chunk a file by the number of lines?

I would like to partition a file into separate buffers, while ensuring that a line is not split up between two or more buffers. I plan on passing these buffers into their own pthreads so they can perform some type of simultaneous/asynchronous processing.

I've read the answer below reading and writing in chunks on linux using c but I don't think it exactly answers the question about making sure that a line is not split up into two or more buffers.

解决方案

I would choose a chunk size in bytes. Then I would seek to the appropriate location in the file and read some smallish number of bytes at a time until I got a newline.

The first chunk's last character is the newline. The second chunk's first character is the character after the newline.

Always seek to a pagesize() boundary and read in pagesize() bytes at a time to search for your newline. This will tend to ensure that you only pull the minimum necessary from disk to find your boundaries. You could try reading like 128 bytes at a time or something. But you then risk making more system calls.

I wrote an example program that does this for letter frequency counting. This, of course, is largely pointless to split into threads as it's almost certainly IO bound. And it also doesn't matter where the newlines are because it isn't line oriented. But, it's just an example. Also, it's heavily reliant on you having a reasonably complete C++11 implementation.

They key function is this:

// Find the offset of the next newline given a particular desired offset.
off_t next_linestart(int fd, off_t start)
{
   using ::std::size_t;
   using ::ssize_t;
   using ::pread;

   const size_t bufsize = 4096;
   char buf[bufsize];

   for (bool found = false; !found;) {
      const ssize_t result = pread(fd, buf, bufsize, start);
      if (result < 0) {
         throw ::std::system_error(errno, ::std::system_category(),
                                   "Read failure trying to find newline.");
      } else if (result == 0) {
         // End of file
         found = true;
      } else {
         const char * const nl_loc = ::std::find(buf, buf + result, '\n');
         if (nl_loc != (buf + result)) {
            start += ((nl_loc - buf) + 1);
            found = true;
         } else {
            start += result;
         }
      }
   }
   return start;
}

Also notice that I use pread. This is absolutely essential when you have multiple threads reading from different parts of the file.

The file descriptor is a shared resource between your threads. When one thread reads from the file using ordinary functions it alters a detail about this shared resource, the file pointer. The file pointer is the position in the file at which the next read will occur.

Simply using lseek before you read each time does not help this because it introduces a race condition between the lseek and the read.

The pread function allows you to read a bunch of bytes from a specific location within the file. It also doesn't alter the file pointer at all. Apart from the fact that it doesn't alter the file pointer, it's otherwise like combining an lseek and a read in the same call.

这篇关于C ++-如何对文件进行分块以进行同时/异步处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆