使用h5py创建大量数据集-无法注册数据类型atom(无法插入重复键) [英] Creating large number of datasets with h5py - Unable to register datatype atom (Can't insert duplicate key)

查看:86
本文介绍了使用h5py创建大量数据集-无法注册数据类型atom(无法插入重复键)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将大量numpy结构化数组作为数据集存储在hdf5文件中.
例如,
f ['tree1'] =结构数组1
.
.
f ['tree60000'] =结构数组60000 (大约有6万棵树),

I am attempting to store a large number of numpy structured array as datasets in a hdf5 file.
For example,
f['tree1'] = structured_array1
.
.
f['tree60000'] = structured_array60000 (there are ~ 60000 trees),

大约70%的读取文件的方式,我得到了错误 RuntimeError:无法注册数据类型原子(无法插入重复的密钥)

About 70% of the way into reading the file, I get the error RuntimeError: Unable to register datatype atom (Can't insert duplicate key)

仅对于很大的ascii文件(10e7行,5gb)会出现此问题.如果文件在周围(10e6行,500mb),则不会发生.如果我取出数据类型并仅将其存储为字符串的numpy数组,也不会发生这种情况.

This problem occurs only for an ascii file that is very large (10e7 lines, 5gb). It does not occur if the file is around (10e6 lines, 500mb). It also does not occur if I take out the datatype and just store as a numpy array of strings.

如果我停止中途读取文件,关闭终端,再次打开,然后继续从中途读取文件,则可以解决此问题(我保存结束的行号).我尝试在python函数本身中打开和关闭hdf5文件,但这没有用.

I can fix this problem if I stop reading halfway into the file, close my terminal, open it again, and continuing reading the file starting from halfway to the end (I save the line number I ended on). I tried opening and closing the hdf5 file in the python function itself, but this did not work.

dt = [
('scale', 'f4'), 
('haloid', 'i8'), 
('scale_desc', 'f4'), 
('haloid_desc', 'i8'), 
('num_prog', 'i4'), 
('pid', 'i8'), 
('upid', 'i8'), 
('pid_desc', 'i8'), 
('phantom', 'i4'), 
('mvir_sam', 'f4'), 
('mvir', 'f4'), 
('rvir', 'f4'), 
('rs', 'f4'), 
('vrms', 'f4'), 
('mmp', 'i4'), 
('scale_lastmm', 'f4'), 
('vmax', 'f4'), 
('x', 'f4'), 
('y', 'f4'), 
('z', 'f4'), 
('vx', 'f4'), 
('vy', 'f4'), 
('vz', 'f4'), 
('jx', 'f4'), 
('jy', 'f4'), 
('jz', 'f4'), 
('spin', 'f4'), 
('haloid_breadth_first', 'i8'), 
('haloid_depth_first', 'i8'), 
('haloid_tree_root', 'i8'), 
('haloid_orig', 'i8'), 
('snap_num', 'i4'), 
('haloid_next_coprog_depthfirst', 'i8'), 
('haloid_last_prog_depthfirst', 'i8'), 
('haloid_last_mainleaf_depthfirst', 'i8'), 
('rs_klypin', 'f4'), 
('mvir_all', 'f4'), 
('m200b', 'f4'), 
('m200c', 'f4'), 
('m500c', 'f4'), 
('m2500c', 'f4'), 
('xoff', 'f4'), 
('voff', 'f4'), 
('spin_bullock', 'f4'), 
('b_to_a', 'f4'), 
('c_to_a', 'f4'), 
('axisA_x', 'f4'), 
('axisA_y', 'f4'), 
('axisA_z', 'f4'), 
('b_to_a_500c', 'f4'), 
('c_to_a_500c', 'f4'), 
('axisA_x_500c', 'f4'), 
('axisA_y_500c', 'f4'), 
('axisA_z_500c', 'f4'), 
('t_by_u', 'f4'), 
('mass_pe_behroozi', 'f4'), 
('mass_pe_diemer', 'f4')
]

def read_in_trees(self):
    """Store each tree as an hdf5 dataset.
    """  
    with open(self.fname) as ascii_file:
        with h5py.File(self.hdf5_name,"r+") as f:
            tree_id = ""  
            current_tree = []
            for line in ascii_file:
                if(line[0]=='#'): #new tree
                        arr = np.array(current_tree, dtype = dt)
                        f[tree_id] = arr
                        current_tree = []
                    tree_id = line[6:].strip('\n')
                else: #read in next tree element
                    current_tree.append(tuple(line.split()))
    return 

错误:

/Volumes/My Passport for Mac/raw_trees/bolshoi/rockstar/asciiReaderOne.py in read_in_trees(self)
    129                             arr = np.array(current_tree, dtype = dt)
    130                             # depth_sort =  arr['haloid_depth_first'].argsort()
--> 131                             f[tree_id] = arr
    132                             current_tree = []
    133                         first_line = False

/Library/Python/2.7/site-packages/h5py/_objects.so in h5py._objects.with_phil.wrapper (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/_objects.c:2458)()

/Library/Python/2.7/site-packages/h5py/_objects.so in h5py._objects.with_phil.wrapper (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/_objects.c:2415)()

/Library/Python/2.7/site-packages/h5py/_hl/group.pyc in __setitem__(self, name, obj)
    281 
    282         else:
--> 283             ds = self.create_dataset(None, data=obj, dtype=base.guess_dtype(obj))
    284             h5o.link(ds.id, self.id, name, lcpl=lcpl)
    285 

/Library/Python/2.7/site-packages/h5py/_hl/group.pyc in create_dataset(self, name, shape, dtype, data, **kwds)
    101         """
    102         with phil:
--> 103             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    104             dset = dataset.Dataset(dsid)
    105             if name is not None:

/Library/Python/2.7/site-packages/h5py/_hl/dataset.pyc in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
    124 
    125     if data is not None:
--> 126         dset_id.write(h5s.ALL, h5s.ALL, data)
    127 
    128     return dset_id

/Library/Python/2.7/site-packages/h5py/_objects.so in h5py._objects.with_phil.wrapper (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/_objects.c:2458)()

/Library/Python/2.7/site-packages/h5py/_objects.so in h5py._objects.with_phil.wrapper (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/_objects.c:2415)()

/Library/Python/2.7/site-packages/h5py/h5d.so in h5py.h5d.DatasetID.write (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5d.c:3260)()

/Library/Python/2.7/site-packages/h5py/h5t.so in h5py.h5t.py_create (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5t.c:15314)()

/Library/Python/2.7/site-packages/h5py/h5t.so in h5py.h5t.py_create (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5t.c:14903)()

/Library/Python/2.7/site-packages/h5py/h5t.so in h5py.h5t._c_compound (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5t.c:14192)()

/Library/Python/2.7/site-packages/h5py/h5t.so in h5py.h5t.py_create (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5t.c:15314)()

/Library/Python/2.7/site-packages/h5py/h5t.so in h5py.h5t.py_create (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5t.c:14749)()

/Library/Python/2.7/site-packages/h5py/h5t.so in h5py.h5t._c_float (/Users/travis/build/MacPython/h5py-wheels/h5py/h5py/h5t.c:12379)()

RuntimeError: Unable to register datatype atom (Can't insert duplicate key)

推荐答案

是否收到错误堆栈?指示在代码中何处产生了错误?

Do you get an error stack? An indication of where in the code the error is produced?

您报告:error RuntimeError: Unable to register datatype atom (Can't insert duplicate key)

在/usr/lib/python3/dist-packages/h5py/_hl/datatype.py

In /usr/lib/python3/dist-packages/h5py/_hl/datatype.py

class Datatype(HLObject):
    # Represents an HDF5 named datatype stored in a file.
    # >>> MyGroup["name"] = numpy.dtype("f")
    def __init__(self, bind):
        """ Create a new Datatype object by binding to a low-level TypeID.

我在这里抛出一个猜测.您的dt有57个字词.我怀疑每次将tree添加到文件时,它都会将每个字段注册为新的datatype.

I'm throwing out a guess here. Your dt has 57 terms. I suspect that each time you add a tree to the file, it registers each field as a new datatype.

In [71]: (57*10e7*.7)/(2**32)
Out[71]: 0.9289942681789397

57 * 10e7的70%接近2 * 32.如果Python/numpy使用int32作为dtype id,那么您可能会达到此限制.

70% of 57 * 10e7 is close to 2*32. If Python/numpy uses int32 as the dtype id, then you could be hitting this limit.

我们必须在h5pynumpy代码中进行更多的挖掘,以找出发出此错误消息的人.

We'd have to dig around more in either the h5py or numpy code to find who emits this error message.

通过以下方式向文件添加数组:

By adding an array to the file with:

f[tree_id] = arr

您正在将每个数组放入新Group中的数据集中.如果每个数据集都有一个dtype或数组的每个字段的数据类型,则可以轻松获得2 * 32个数据类型.

you are putting each array in a Dataset in a new Group. If each Dataset has a dtype, or datatype for each field of array, you could easily get 2*32 datatypes.

另一方面,如果您可以将多个arr存储到组或数据集,则可以避免这种成千上万个数据类型的注册.我对h5py不够熟悉,无法提出建议.

If on the other hand you could store multiple arr to a Group or Dataset, you might avoid this registration of thousands of datatypes. I'm not familiar enough with h5py to suggest how you do that.

我想知道此序列是否可以为多个数据集重用数据类型:

I wonder if this sequence works to reuse the datatype for multiple datasets:

dt1=np.dtype(dt)
gg= f.create_group('testgroup')
gg['xdtype']=dt1
# see h5py.Datatype doc
xdtype=gg['xdtype']
x=np.zeros((10,),dtype=xdtype)
gg['tree1']=x
x=np.ones((10,),dtype=xdtype)
gg['tree2']=x

Datatype文档之后,我试图注册一个命名的数据类型,并将其用于添加到组中的每个数据集.

Following the Datatype doc I am trying to register a named datatype, and use it for each of the datasets added to the group.

In [117]: isinstance(xdtype, h5py.Datatype)
Out[117]: True
In [118]: xdtype.id
Out[118]: <h5py.h5t.TypeCompoundID at 0xb46e0d4c>

因此,如果我正确阅读了def make_new_dset,则绕过py_create调用.

So if I am reading def make_new_dset correctly, this bypasses the py_create call.

这篇关于使用h5py创建大量数据集-无法注册数据类型atom(无法插入重复键)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆