多线程阅读从C ++文件? [英] Multi threaded reading from a file in c++?

查看:115
本文介绍了多线程阅读从C ++文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用程序使用文本文件来存储数据到文件。
我正在测试通过多线程操作来读取它的最快方法。
我使用了以下两种技术:


  1. 使用NUMBER_OF_PROCESSORS环境变量的流量。每个流都在不同的线程上。为每个流均等分配文件中的全部行数。解析文本。
    只有一个流解析整个文件并将数据加载到内存中。创建线程(= NUM​​BER_OF_PROCESSORS - 1)来解析内存中的数据。



  2. 测试运行在不同的文件大小100kB - 800MB。
    档案中的资料:

      100.23123 -42343.342555 ...(依此类推)
    4928340 -93240.2 349 ...
    ...

    数据存储在 double



    结果:这两个方法大概在解析文件的时间相同。 / b>

    问题:我应该选择哪种方法?

    方法1对于硬盘作为多个读取访问同时在随机位置进行。

    方法2是坏的,因为所需内存与文件大小成正比。这可以通过将容器限制为固定大小来部分克服,删除解析的内容并且再次从读取器中填充它。但是这增加了处理时间。方法2有一个连续的瓶颈(单线程读取和分发工作项目)。根据阿姆达尔斯法则,这不会无限缩小。这是一个非常公平和可靠的方法。

    方法1没有瓶颈,并且会扩展。确保不要在磁盘上造成随机IO。我会使用一个互斥体一次只有一个线程读取。读大4-16MB的连续块。在磁盘执行单个磁头寻找的时候,它可能已经读取了大约1MB的数据。

    如果解析这些行需要相当长的时间,则不能使用方法2因为大的顺序部分。它不会缩放。如果解析速度很快,则使用方法2,因为它更容易得到正确的结果。为了说明瓶颈的概念:想象一下,1.000.000个计算线程要求一个阅读器线程给他们线。那一个读者线程将不能按照他们所要求的那样快速地分发线条。吞吐量不会达到1e6倍。这不会缩放。但是,如果1e6个线程独立于非常快的IO设备读取,您将获得吞吐量的1/6倍,因为没有瓶颈。 (我已经用了极端的数字来表达这个观点,相同的想法也适用于这个小小的)


    My application uses text file to store data to file. I was testing for the fastest way of reading it by multi threading the operation. I used the following 2 techniques:

    1. Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.

    2. Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.

    The test was run on various file sizes 100kB - 800MB. Data in file:

    100.23123 -42343.342555 ...(and so on)
    4928340 -93240.2 349 ...
    ...
    

    The data is stored in 2D array of double.

    Result: Both methods take approximately the same time for parsing the file.

    Question: Which method should I choose?

    Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.

    Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.

    解决方案

    Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.

    Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.

    If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.

    To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)

    这篇关于多线程阅读从C ++文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆