将列添加到Pandas中HDF文件的框架 [英] Appending Column to Frame of HDF File in Pandas

查看:674
本文介绍了将列添加到Pandas中HDF文件的框架的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用CSV格式的大型数据集。我试图逐列处理数据,然后将数据附加到HDF文件中的帧。所有这一切都使用熊猫。我的动机是,虽然整个数据集比我的物理内存大得多,列大小是可管理的。在稍后的阶段,我将通过将这些列加载到内存中并对其进行操作来执行特征智能逻辑回归。

I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.

我可以创建一个新的HDF文件,并使用第一列创建一个新框架:

I am able to make a new HDF file and make a new frame with the first column:

hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)

但是之后,当试图向框架中添加一个新列时,我得到一个ValueError:

But after that, I get a ValueError when trying to append a new column to the frame:

feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)

堆栈跟踪和错误消息:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data

我是处理大型数据集和有限内存的新手,所以我建议使用这些数据的替代方法。

I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.

推荐答案

完整文档是此处,有些食谱策略此处

complete docs are here, and some cookbook strategies here

PyTables是面向行的,所以你只能追加行。读取csv chunk-by-chunk,然后在你去时添加整个框架,如下:

PyTables is row-oriented, so you can only append rows. Read the csv chunk-by-chunk then append the entire frame as you go, something like this:

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
         store.append('df',chunk)
store.close()

你必须小心,因为它可能是当读取chunk-by-chunk以具​​有不同的dtypes时,例如你有一个整数,像没有缺少值的列,直到说第二个块。第一个块将该列作为 int64 ,第二个作为 float64 。您可能需要使用 dtype 关键字将dtypes强制为 read_csv ,请参阅此处

You must be a tad careful as it is possiible for the dtypes of the resultant frrame when read chunk-by-chunk to have different dtypes, e.g. you have a integer like column that doesn't have missing values until say the 2nd chunk. The first chunk would have that column as an int64, while the second as float64. You may need to force dtypes with the dtype keyword to read_csv, see here.

这里也是一个类似的问题。

here is a similar question as well.

这篇关于将列添加到Pandas中HDF文件的框架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆