我可以将自己的类对象存储到hdf5中吗? [英] Can I store my own class object into hdf5?

查看:109
本文介绍了我可以将自己的类对象存储到hdf5中吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的课程:

class C:
     def __init__(self, id, user_id, photo):
         self.id = id
         self.user_id = user_id
         self.photo = photo

我需要创建数百万个此类对象. id和user_id一样是一个整数,但是photo是一个布尔数组,大小为64.我的老板希望我将它们全部存储在hdf5文件中.我还需要能够根据其user_id属性进行查询,以获取具有相同user_id的所有照片.首先,我该如何存储它们?甚至可以吗?其次,一旦我存储(如果可以的话)如何查询它们?谢谢.

I need to create millions of these objects. id is an integer as well as user_id but photo is a bool array of size 64. My boss wants me to store all of them inside hdf5 files. I also need to be able to make queries according to their user_id attributes to get all of the photos that have the same user_id. Firstly, how do I store them? Or even can I? And secondly, once I store(if I can) them how do I query them? Thank you.

推荐答案

尽管您可以将整个数据结构存储在单个HDF5表中,但将所描述的类存储为三个单独的变量-两个1D数组整数和用于存储照片"属性的数据结构.

Although you can store the whole data structure in a single HDF5 table, it is probably much easier to store the described class as three separate variables - two 1D arrays of integers and a data structure for storing your 'photo' attribute.

如果您关心文件的大小和速度,而不关心文件的可读性,则可以将64个bool值建模为8个1D的UINT8数组或2个NINT 8的UINT8数组(或CHAR) .然后,您可以实现一个简单的接口,该接口将bool值打包到UINT8的各个位中并返回(例如,

If you care about file size and speed and do not care about human-readability of your files, you can model your 64 bool values either as 8 1D arrays of UINT8 or a 2D array N x 8 of UINT8 (or CHARs). Then, you can implement a simple interface that would pack your bool values into bits of UINT8 and back (e.g., How to convert a boolean array to an int array)

据了解,HDF5中没有内置的搜索功能,但是您可以读取包含user_ids的变量,然后只需使用Python查找与user_id匹配的所有元素的索引.

As far as know, there are no built-in search functions in HDF5, but you can read in the variable containing user_ids and then simply use Python to find indexes of all elements matching your user_id.

一旦有了索引,就可以读入其他变量的相关部分. HDF5本机支持有效切片,但它可在范围内使用,因此您可能需要考虑如何将具有相同user_id的记录连续存储在块中,请参见此处的讨论

Once you have the indexes, you can read in the relevant slices of your other variables. HDF5 natively supports efficient slicing, but it works on ranges, so you might want to think how to store records with the same user_id in continuous chunks, see discussion over here

h5py:对数组数据集进行切片的正确方法

您可能还想研究pytables-一种在hdf5上构建的python接口,用于将数据存储在类似表的结构中.

You might also want to look into pytables - a python interace that builds over hdf5 to store data in table-like strucutres.

import numpy as np
import h5py


class C:
    def __init__(self, id, user_id, photo):
        self.id = id
        self.user_id = user_id
        self.photo = photo

def write_records(records, file_out):

    f = h5py.File(file_out, "w")

    dset_id = f.create_dataset("id", (1000000,), dtype='i')
    dset_user_id = f.create_dataset("user_id", (1000000,), dtype='i')
    dset_photo = f.create_dataset("photo", (1000000,8), dtype='u8')
    dset_id[0:len(records)] = [r.id for r in records]
    dset_user_id[0:len(records)] = [r.user_id for r in records]
    dset_photo[0:len(records)] = [np.packbits(np.array(r.photo, dtype='bool').astype(int)) for r in records]
    f.close()

def read_records_by_id(file_in, record_id):
    f = h5py.File(file_in, "r")
    dset_id = f["id"]
    data = dset_id[0:2]
    res = []
    for idx in np.where(data == record_id)[0]:
        record = C(f["id"][idx:idx+1][0], f["user_id"][idx:idx+1][0], np.unpackbits( np.array(f["photo"][idx:idx+1][0],  dtype='uint8') ).astype(bool))
        res.append(record)
    return res 

m = [ True, False,  True,  True, False,  True,  True,  True]
m = m+m+m+m+m+m+m+m
records = [C(1, 3, m), C(34, 53, m)]

# Write records to file
write_records(records, "mytestfile.h5")

# Read record from file
res = read_records_by_id("mytestfile.h5", 34)

print res[0].id
print res[0].user_id
print res[0].photo

这篇关于我可以将自己的类对象存储到hdf5中吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆