如何直接和有效地访问非常大的文本文件? [英] How to access directly and efficiently on very large text file?

查看:107
本文介绍了如何直接和有效地访问非常大的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我想读一些数据挖掘工艺非常大的文本文件(+ 10GB)。
要做到这一点,我用MPI并行工艺如此多的进程可以一起访问同一个文件。结果
其实,我想每个进程读取线的N多。由于文件不规整(同样数量的字段,但每个字段可以包含不同数量的字符),我是在解析文件的义务,并且不平行,并需要花费大量的时间。
有什么办法可以直接访问在线withount解析和计数线的具体数量?
谢谢你的帮助。

I have a very large text files (+10GB) which i want to read for some data mining technics. To do that, i use parallel technics with MPI so many processes can access together to the same file.
In fact, i want that each process read N number of lines. Since the file is not structured (same number of fields but each field can contain different number of characters), i'm in the obligation to parse the file and that is not parallel and it takes a lot of time. Is there any way to access directly to a specific number of line withount parsing and counting the lines? Thank you for you help.

推荐答案

如果你的文件没有索引,否则,就没有直接的方法。

If your file isn't otherwise indexed, there is no direct way.

索引它可能是值得的(扫描一次找到所有的行结束,并存储每行或行块的偏移量)。如果需要将文件多次处理,并且它不改变,索引它的成本可以由易于使用索引为进一步运行的偏移。

Indexing it might be worth it (scan it once to find all the line endings, and store the offsets of each line or chunk of lines). If you need to process the file multiple times, and it does not change, the cost of indexing it could be offset by the ease of using the index for further runs.

否则,如果你并不需要所有的作业均的究竟的相同数量的行/项目,你可以只是捏造的。结果
谋到了一个给定的偏移(1G说),并查找最接近的行分隔符。偏移2G等重复,直到你找到足够的破发点。

Otherwise, if you don't need all the jobs to have exactly the same number of lines/items, you could just fudge it.
Seek to a given offset (say 1G), and look for the closest line separator. Repeat at offset 2G, etc. until you've found enough break points.

您可以再火了你的并行任务,每个你已经确定了块。

You can then fire off your parallel tasks on each of the chunks you've identified.

这篇关于如何直接和有效地访问非常大的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆