尝试打开h5py中的 pandas 创建的hdf时缺少列 [英] Column missing when trying to open hdf created by pandas in h5py

查看:319
本文介绍了尝试打开h5py中的 pandas 创建的hdf时缺少列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的数据框的外观.第一列是单个int.第二列是512个整数的单个列表.

This is what my dataframe looks like. The first column is a single int. The second column is a single list of 512 ints.

IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131    [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716    [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098    [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...

我将其保存到hdf,并尝试使用

I saved it to hdf and tried opening it using

df.to_hdf('test.h5', key='df', data_columns=True)
h3 = h5py.File('test.h5')

列出按键时会看到4个按键

I see 4 keys when I list the keys

h3['df'].keys()

KeysViewHDF5 ['axis0','axis1','block0_items','block0_values']

KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']

Axis1似乎包含第一列的值

Axis1 sees to contain the values for the first column

h3['df']['axis1'][0:5]

array([1899317,22861131,2163410,15760716,12244098,

array([ 1899317, 22861131, 2163410, 15760716, 12244098,

但是,第二列似乎没有数据.确实有另一列包含其他数据

However, there doesn't seem to be data from the second column. There does is another column with other data

h3['df']['block0_values'][0][0:5]

但这似乎与第二列中的任何数据都不对应

But that doesn't seem to correspond to any of the data in the second column

array([128,4,149,1,0],dtype = uint8)

array([128, 4, 149, 1, 0], dtype=uint8)

目的

我最终尝试创建一个内存映射的数据存储,该数据存储使用特定索引来检索数据.

Purpose

I am eventually trying to create a datastore that's memory mapped, that retrieves data using particular indices.

类似

h3['df']['workingIndex'][22861131, 15760716] 

将检索

[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...],
[0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...

推荐答案

问题是您要序列化Pandas系列的Python列表,它不是矩形的(呈锯齿状).

The problem is you're trying to serialize a Pandas Series of Python lists and it is not rectangular (it is jagged).

Pandas和HDF5主要用于矩形(多维数据集,超多维数据集等)数据,而不是用于锯齿状列表列表.

Pandas and HDF5 are largely used for rectangular (cube, hypercube, etc) data, not for jagged lists-of-lists.

致电to_hdf()时是否看到此警告?

Did you see this warning when you call to_hdf()?

PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Ids']]

它试图告诉您的是,不以直观,高性能的方式支持列表列表.而且,如果在输出文件上运行像h5dump这样的HDF5可视化工具,您将看到问题所在.索引(行为良好)如下所示:

What it's trying to tell you is that lists-of-lists are not supported in an intuitive, high-performance way. And if you run an HDF5 visualization tool like h5dump on your output file, you'll see what's wrong. The index (which is well-behaved) looks like this:

  DATASET "axis1" {
     DATATYPE  H5T_STD_I64LE
     DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
     DATA {
     (0): 1899317, 22861131, 2163410, 15760716, 12244098
     }
     ATTRIBUTE "CLASS" {
        DATA {
        (0): "ARRAY"
        }
     }

但是值(列表列表)看起来像这样:

But the values (lists of lists) look like this:

  DATASET "block0_values" {
     DATATYPE  H5T_VLEN { H5T_STD_U8LE}
     DATASPACE  SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
     DATA {
     (0): (128, 5, 149, 164, ...)
     }
     ATTRIBUTE "CLASS" {
        DATA {
        (0): "VLARRAY"
        }
     }
     ATTRIBUTE "PSEUDOATOM" {
        DATA {
        (0): "object"
        }
     }

正在发生的事情正是PerformanceWarning警告您的内容:

What's happening is exactly what the PerformanceWarning warned you about:

> PyTables will pickle object types that it cannot map directly to c-types

您的列表列表已被腌制并存储为H5T_VLEN,这只是一个字节.

Your list-of-lists is being pickled and stored as H5T_VLEN which is just a blob of bytes.

以下是一些可以解决此问题的方法:

Here are some ways you could fix this:

  1. 将每一行存储在HDF5中的单独键下.也就是说,每个列表将被存储为一个数组,并且它们都可以具有不同的长度. HDF5没问题,因为它在一个文件中支持任意数量的密钥.
  2. 将数据更改为矩形,例如通过用零填充较短的列表.请参阅: Pandas将列表的列分成多列
  3. 使用h5py以您喜欢的任何格式写入数据.它比Pandas/PyTables更加灵活,并且创建更简单(但功能更强大)的HDF5文件.这是一个示例(尽管h5py不够漂亮,但它实际上可以存储锯齿状的数组):使用h5py存储多维可变长度数组
  1. Store each row under a separate key in HDF5. That is, each list will be stored as an array, and they can all have different lengths. This is no problem with HDF5, because it supports any number of keys in one file.
  2. Change your data to be rectangular, e.g. by padding the shorter lists with zeros. See: Pandas split column of lists into multiple columns
  3. Use h5py to write the data in whatever format you like. It's much more flexible and creates simpler (and yet more powerful) HDF5 files than Pandas/PyTables. Here's one example (which shows h5py can actually store jagged arrays, though it's not pretty): Storing multidimensional variable length array with h5py

这篇关于尝试打开h5py中的 pandas 创建的hdf时缺少列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆