Python3使用tar文件中的csv文件 [英] Python3 working with csv files in tar files

查看:333
本文介绍了Python3使用tar文件中的csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用tar.gz文件中包含的csv文件,但是在将正确的数据/对象传递到csv模块时遇到了问题.

I am trying to work with csv files contained in a tar.gz file and I am having issues passing the correct data/object through to the csv module.

说我有一个tar.gz文件,其中包含许多格式如下的csv文件.

Say I have a tar.gz file with a number of csv files formated as follows.

1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38

我希望能够访问内存中的每个csv文件,而无需从tar文件中提取每个文件并将它们写入磁盘. 例如:

I want to be able to access each csv file in memory without extracting each file from the tar file and writing them to disk. For example:

import tarfile
import csv

tar = tarfile.open("tar-file.tar.gz")

for member in tar.getmembers():
    f = tar.extractfile(member).read()
    content = csv.reader(f)
    for row in content:
        print(row)
tar.close()

这会产生以下错误.

    for row in content:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)

我还尝试按照csv模块文档中的说明将f解析为字符串.

I have also tried parsing f as a string as described in the csv module documentation.

content = csv.reader([f])

以上内容会产生相同的错误.

The above produces the same error.

我尝试将文件对象f解析为ascii.

I have tried parsing the file object f as ascii.

f = tar.extractfile(member).read().decode('ascii')

但是这将迭代每个csv元素,而不是迭代包含元素列表的行.

but this iterates each csv element instead of iterating rows containing lists of elements.

['1']
['0']
['7']
['9']
['', '']
['S']
['A']
['M']
['P']
['L']
['E']
['_']
['A']
['', '']
['G']
['R']

狙击...

['2']
['0']
['1']
['7']
['/']
['0']
['2']
['/']
['1']
['5']
[' ']
['2']
['2']
[':']
['5']
['7']
[':']
['3']
['8']
[]
[]

尝试将f解析为ascii并将其读取为字符串

Trying to both parse f as ascii and read it as a string

f = tar.extractfile(member).read().decode('ascii')
content = csv.reader([f])

产生以下输出

    for row in content:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

为了演示不同的输出,我使用了以下代码.

To demonstrate the different outputs I used the following code.

import tarfile
import csv

tar = tarfile.open("tar-file.tar.gz")

for member in tar.getmembers():
    f = tar.extractfile(member).read()
    print(member.name)
    print('Raw :', type(f))
    print(f)
    print()
    f = f.decode('ascii')
    print('ASCII:', type(f))
    print(f)
tar.close()

这将产生以下输出. (此示例中的每个csv都包含相同的数据).

This produces the following output. (each csv contains the same data for this example).

./raw_data/csv-file1.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'

ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38


./raw_data/csv-file2.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'

ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38


./raw_data/csv-file3.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'

ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38

如何获取csv模块以正确读取tar模块提供的内存中的文件? 谢谢.

How can I get the csv module to correctly read a file in memory provided by the tar module? Thanks.

推荐答案

您只需要使用io.StringIO()生成一个类似于object的文件,供csv库使用.例如:

You just need to use io.StringIO() to produce a file like object for the csv library to use. For example:

import tarfile
import csv
import io

with tarfile.open('input.rar') as tar:
    for member in tar:
        if member.isreg():      # Is it a regular file?
            print("{} - {} bytes".format(member.name, member.size))
            csv_file = io.StringIO(tar.extractfile(member).read().decode('ascii'))

            for row in csv.reader(csv_file):
                print(row)

这篇关于Python3使用tar文件中的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆