尝试打开h5py中的 pandas 创建的hdf时缺少列 [英] Column missing when trying to open hdf created by pandas in h5py
问题描述
这是我的数据框的外观.第一列是单个int.第二列是512个整数的单个列表.
This is what my dataframe looks like. The first column is a single int. The second column is a single list of 512 ints.
IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
我将其保存到hdf,并尝试使用
I saved it to hdf and tried opening it using
df.to_hdf('test.h5', key='df', data_columns=True)
h3 = h5py.File('test.h5')
列出按键时会看到4个按键
I see 4 keys when I list the keys
h3['df'].keys()
KeysViewHDF5 ['axis0','axis1','block0_items','block0_values']
KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']
Axis1似乎包含第一列的值
Axis1 sees to contain the values for the first column
h3['df']['axis1'][0:5]
array([1899317,22861131,2163410,15760716,12244098,
array([ 1899317, 22861131, 2163410, 15760716, 12244098,
但是,第二列似乎没有数据.确实有另一列包含其他数据
However, there doesn't seem to be data from the second column. There does is another column with other data
h3['df']['block0_values'][0][0:5]
但这似乎与第二列中的任何数据都不对应
But that doesn't seem to correspond to any of the data in the second column
array([128,4,149,1,0],dtype = uint8)
array([128, 4, 149, 1, 0], dtype=uint8)
目的
我最终尝试创建一个内存映射的数据存储,该数据存储使用特定索引来检索数据.
Purpose
I am eventually trying to create a datastore that's memory mapped, that retrieves data using particular indices.
类似
h3['df']['workingIndex'][22861131, 15760716]
将检索
[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...],
[0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
推荐答案
问题是您要序列化Pandas系列的Python列表,它不是矩形的(呈锯齿状).
The problem is you're trying to serialize a Pandas Series of Python lists and it is not rectangular (it is jagged).
Pandas和HDF5主要用于矩形(多维数据集,超多维数据集等)数据,而不是用于锯齿状列表列表.
Pandas and HDF5 are largely used for rectangular (cube, hypercube, etc) data, not for jagged lists-of-lists.
致电to_hdf()
时是否看到此警告?
Did you see this warning when you call to_hdf()
?
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Ids']]
它试图告诉您的是,不以直观,高性能的方式支持列表列表.而且,如果在输出文件上运行像h5dump
这样的HDF5可视化工具,您将看到问题所在.索引(行为良好)如下所示:
What it's trying to tell you is that lists-of-lists are not supported in an intuitive, high-performance way. And if you run an HDF5 visualization tool like h5dump
on your output file, you'll see what's wrong. The index (which is well-behaved) looks like this:
DATASET "axis1" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 5 ) / ( 5 ) }
DATA {
(0): 1899317, 22861131, 2163410, 15760716, 12244098
}
ATTRIBUTE "CLASS" {
DATA {
(0): "ARRAY"
}
}
但是值(列表列表)看起来像这样:
But the values (lists of lists) look like this:
DATASET "block0_values" {
DATATYPE H5T_VLEN { H5T_STD_U8LE}
DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
DATA {
(0): (128, 5, 149, 164, ...)
}
ATTRIBUTE "CLASS" {
DATA {
(0): "VLARRAY"
}
}
ATTRIBUTE "PSEUDOATOM" {
DATA {
(0): "object"
}
}
正在发生的事情正是PerformanceWarning警告您的内容:
What's happening is exactly what the PerformanceWarning warned you about:
> PyTables will pickle object types that it cannot map directly to c-types
您的列表列表已被腌制并存储为H5T_VLEN,这只是一个字节.
Your list-of-lists is being pickled and stored as H5T_VLEN which is just a blob of bytes.
以下是一些可以解决此问题的方法:
Here are some ways you could fix this:
- 将每一行存储在HDF5中的单独键下.也就是说,每个列表将被存储为一个数组,并且它们都可以具有不同的长度. HDF5没问题,因为它在一个文件中支持任意数量的密钥.
- 将数据更改为矩形,例如通过用零填充较短的列表.请参阅: Pandas将列表的列分成多列
- 使用h5py以您喜欢的任何格式写入数据.它比Pandas/PyTables更加灵活,并且创建更简单(但功能更强大)的HDF5文件.这是一个示例(尽管h5py不够漂亮,但它实际上可以存储锯齿状的数组):使用h5py存储多维可变长度数组
- Store each row under a separate key in HDF5. That is, each list will be stored as an array, and they can all have different lengths. This is no problem with HDF5, because it supports any number of keys in one file.
- Change your data to be rectangular, e.g. by padding the shorter lists with zeros. See: Pandas split column of lists into multiple columns
- Use h5py to write the data in whatever format you like. It's much more flexible and creates simpler (and yet more powerful) HDF5 files than Pandas/PyTables. Here's one example (which shows h5py can actually store jagged arrays, though it's not pretty): Storing multidimensional variable length array with h5py
这篇关于尝试打开h5py中的 pandas 创建的hdf时缺少列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!