通过多线程访问文件 [英] Access File through multiple threads

查看:143
本文介绍了通过多线程访问文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过10个线程访问一个大文件(文件大小可能从30 MB到1 GB不等),然后处理文件中的每一行,并通过10个线程将它们写入另一个文件。如果我只使用一个线程来访问IO,则其他线程被阻塞。处理需要一些时间,几乎相当于从文件系统读取一行代码。还有一个约束,输出文件中的数据应该与输入文件的顺序相同。

I want to access a large file (file size may vary from 30 MB to 1 GB) through 10 threads and then process each line in the file and write them to another file through 10 threads. If I use only one thread to access the IO, the other threads are blocked. The processing takes some time almost equivalent to reading a line of code from file system. There is one more constraint, the data in the output file should be in the same order as that of the input file.

我想要你对这个系统的设计的想法。是否有任何现有的API支持并发访问文件?

I want your thoughts on the design of this system. Is there any existing API to support concurrent access to files?

同时写入同一个文件可能导致死锁。

Also writing to same file may lead to deadlock.

如果我关心时间限制,请建议如何实现这一点。

Please suggest how to achieve this if I am concerned with time constraint.

推荐答案


  • 您应该文件阅读 抽象

    • You should abstract from the file reading. Create a class that reads the file and dispatches the content to a various number of threads.
    • 类别 不应该分派字符串,它应该将它们包装在包含元信息 Line 类中,e。 G。

      The class shouldn't dispatch strings, it should wrap them in a Line class that contains meta information, e. g. The line number, since you want to keep the original sequence.


      • 您需要一个行号 >处理类,对收集的数据执行实际工作。在你的情况下,没有工作要做。类只是存储信息,你可以延长它一天做额外的东西(例如,反向字符串附加一些其他字符串,...)

      • You need a processing class, that does the actual work on the collected data. In your case there is no work to do. The class just stores the information, you can extend it someday to do additional stuff (E.g. reverse the string. Append some other strings, ...)

      您需要一个合并类,在处理线程上执行某种多路合并排序收集 <$ c $

      Then you need a merger class, that does some kind of multiway merge sort on the processing threads and collects all the references to the Line instances in sequence.

      合并类也可以将数据写回文件,但要保持代码干净...

      The merger class could also write the data back to a file, but to keep the code clean...


      • 我建议创建一个 output class ,这再次提取所有的文件处理和东西。

      • I'd recommend to create a output class, that again abstracts from all the file handling and stuff.

      当然,你需要很多内存这种方法,如果你在主内存短。

      Of course you need much memory for this approach, if you are short on main memory. You'd need a stream based approach that kind of works inplace to keep the memory overhead small.

      UPDATE 基于流的方法

      除了:

      读取器线程将读取的数据抽入气球。此气球具有一定数量的 Line 实例,它可以容纳(数字越大,您使用的主内存越多)。

      The Reader thread pumps the read data into a Balloon. This balloon has a certain number of Line instances it can hold (The bigger the number, the more main memory you consume).

      处理线程从气球中取出 Line ,读者将更多的行注入气球,因为它变得更空。

      The processing threads take Lines from the balloon, the reader pumps more lines into the balloon as it gets emptier.

      合并类从上面的处理线程中获取行,并且作者将数据写回到文件中。

      The merger class takes the lines from the processing threads as above and the writer writes the data back to a file.

      也许你应该使用 FileChannel 在I / O线程,因为它更适合读取大文件,并可能消耗更少的内存,而处理该文件(但这只是一个估计的猜测)。

      Maybe you should use FileChannel in the I/O threads, since it's more suited for reading big files and probably consumes less memory while handling the file (but that's just an estimated guess).

      这篇关于通过多线程访问文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆