在python中逐行迭代大型.xz文件 [英] Iterate a large .xz file line by line in python

查看：328 发布时间：2020/6/29 21:19:04 python lzma xz

本文介绍了在python中逐行迭代大型.xz文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的.xz文件(几GB).它充满了纯文本.我想处理文本以创建自定义数据集.我想逐行阅读，因为它太大了.任何人都有一个想法怎么做?

I have a large .xz file (few gigabytes). It's full of plain text. I want to process the text to create custom dataset. I want to read it line by line because it is too big. Anyone have an idea how to do it ?

我已经尝试过了如何在内存中打开和读取LZMA文件但它不起作用.

I already tried this How to open and read LZMA file in-memory but it's not working.

我收到此错误"ascii"编解码器，无法在位置0解码字节0xfd:序数不在range(128)

i got this error 'ascii' codec can't decode byte 0xfd in position 0: ordinal not in range(128)

在链接

我的代码(使用python 3.5)

My code (using python 3.5)

with open(filename) as compressed:
with lzma.LZMAFile(compressed) as uncompressed:
    for line in uncompressed:
        print(line)

推荐答案

几周前，我遇到了同样的问题.此代码段对我有用:

I was faced to the same question some weeks ago. This snippet worked for me:

import lzma
with lzma.open('filename.xz', mode='rt') as file:
    for line in file:
       print(line)

这假设压缩文件中的文本数据是用 utf-8 编码的(我的数据就是这种情况).函数lzma.open()中有一个encoding参数，您可以根据需要设置其他编码

This assumes that the text data in the compressed file was encoded in utf-8 (which was the case for my data). There is an encoding argument in function lzma.open() which allows you to set another encoding if needed

编辑(您自己编辑后):尝试在lmza.open()

EDIT (after you own edit): try to force encoding='utf-8' in lmza.open()

这篇关于在python中逐行迭代大型.xz文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中逐行迭代大型.xz文件 [英] Iterate a large .xz file line by line in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中逐行迭代大型.xz文件 [英] Iterate a large .xz file line by line in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭