从大文件即时访问行,而无需加载文件 [英] Instant access to line from a large file without loading the file

查看:99
本文介绍了从大文件即时访问行,而无需加载文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我最近的一个项目中,我需要执行此简单任务,但是我不确定执行此操作的最有效方法是什么.

我有几个大的文本文件(> 5GB),我需要不断地从这些文件中提取随机行.要求是:我无法将文件加载到内存中,我需要非常高效地执行此操作(每秒>> 1000行),并且最好需要进行尽可能少的预处理.

文件由许多短行〜(2000万行)组成. 原始"文件的行长不同,但是通过较短的预处理,我可以使所有行的长度都相同(尽管完美的解决方案不需要预处理)

我已经尝试过在此处中提到的默认python解决方案,但它们太慢(并且线缓存解决方案将文件加载到内存中,因此在这里不可用)

我想到的下一个解决方案是创建某种索引.我发现了此解决方案但是它已经过时了,因此需要一些工作才能开始工作,即使这样,我也不确定在处理索引文件期间创建的开销是否不会将过程减慢到上述解决方案的时间范围.

另一种解决方案是将文件转换为二进制文件,然后以这种方式立即访问行.对于此解决方案,我找不到任何支持二进制文本工作的python程序包,而且我觉得这种方式创建健壮的解析器可能会花费很长时间,并且由于计算错误小而可能会产生许多难以诊断的错误. /错误.

我想到的最终解决方案是使用某种数据库(在我的情况下为sqlite),这需要将行转移到数据库中并以这种方式加载它们.

注意:每次我还将加载数千条(随机)行,因此对于行组更有效的解决方案将具有优势.

预先感谢

艺术.

解决方案

如评论中所述,我相信使用hdf5是一个不错的选择. 答案显示了如何读取此类文件

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.

I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.

The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)

I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)

The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.

Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.

The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.

Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.

Thanks in advance,

Art.

解决方案

As said in the comments, I believe using hdf5 would we a good option. This answer shows how to read that kind of file

这篇关于从大文件即时访问行,而无需加载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆