无法从gzip压缩文件读取csv数据,该文件存储使用Pandas的归档文件的名称 [英] Can't read csv data from gzip-compressed file which stores name of archived file with Pandas
问题描述
我正在尝试从gzip存档文件读取csv数据,它还存储归档数据文件的名称。问题是,pandas.read_csv()选择归档文件的名称,并将其作为返回的DataFrame中的第一个数据条目返回。如何跳过已归档文件的名称?我查看了所有可用的选项pandas.read_csv(),并找不到一个将允许我这样做。
I am trying to read csv data from gzip archive file which also stores name of the archived data file. The problem is that pandas.read_csv() picks the name of the archived file and returns it as very first data entry in returned DataFrame. How can I skip the name of the archived file? I looked at all available options of pandas.read_csv() and could not find the one that would allow me to do it.
这里是如何创建我的gzip档案文件in python:
Here is how I create my gzip archive file in python:
import pandas as pn
import numpy as np
import tarfile
a = np.ones((10, 8))
np.savetxt('ones.dat', a)
fh = tarfile.open('ones.tar.gz', 'w:gz')
fh.add('ones.dat', arcname='numpy_ones.dat')
fh.close()
f = pn.read_csv('ones.tar.gz', compression='gzip', sep='\s+', header=None)
In [32]: f
Out[32]:
0 1 2 3 4 5 6 7 8
0 numpy_ones.dat 1 1 1 1 1 1 1 1
1 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
2 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
3 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
4 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
5 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
6 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
7 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
8 1.000000000000000000e+00 1 1 1 1 1 1 1 NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN
我使用的是Python 3.4.3(v3.4.3:9b73f1c3e601,2015年2月23日,02:52:03)。
Numpy:'1.9.2'
Pandas:'0.16.2'
I am using Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03). Numpy: '1.9.2' Pandas: '0.16.2'
非常感谢,
Masha
Many thanks, Masha
推荐答案
再次使用tarfile:
Use tarfile again:
fh = tarfile.open('ones.tar.gz', 'r:gz')
f = fh.extractfile('numpy_ones.dat')
df = pd.read_csv(f, delim_whitespace=True, header=None)
这篇关于无法从gzip压缩文件读取csv数据,该文件存储使用Pandas的归档文件的名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!