读取Python 2.7中的大型lz4压缩JSON数据集 [英] Reading large lz4 compressed JSON data set in Python 2.7

查看:341
本文介绍了读取Python 2.7中的大型lz4压缩JSON数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要分析作为lz4压缩JSON文件分发的大型数据集.

I need to analyze a large data set that is distributed as a lz4 compressed JSON file.

压缩文件将近1TB.由于成本原因,我宁愿不将其解压缩到磁盘上.数据集中的每个记录"都非常小,但是将整个数据集读取到内存中显然是不可行的.

The compressed file is almost 1TB. I'd prefer not to uncompress it to disk due to cost. Each "record" in the dataset is very small, but it is obviously not feasible to read the entire data set into memory.

关于如何遍历Python 2.7中这个大型lz4压缩JSON文件中的记录的任何建议吗?

Any advice on how to iterate through records in this large lz4 compressed JSON file in Python 2.7?

推荐答案

As of version 0.19.1 of the python lz4 bindings, there is full support for buffered IO provided. So, you should be able to do something like:

import lz4.frame
chunk_size = 128 * 1024 * 1024
with lz4.frame.open('mybigfile.lz4', 'r') as file:
    chunk = file.read(size=chunk_size)
    # Do stuff with this chunk of data.

它将一次从文件中读取大约128 MB的数据.

which will read in data from the file at around 128 MB at a time.

此外:我是python lz4软件包的维护者-请在项目上处理文件问题页面(如果软件包有问题)或文档.

Aside: I am the maintainer of the python lz4 package - please do file issues on the project page if you have problems with the package, or if something is not clear in the documentation.

这篇关于读取Python 2.7中的大型lz4压缩JSON数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆