使用Unicode将字符串数据集存储在HDF5中 [英] Storing string datasets in hdf5 with unicode
问题描述
我正在尝试从包含特殊字符(如ø, æ , and å
)的文件中存储变量字符串表达式.这是我的代码:
I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å
. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
但是,文本存储不正确.存储的数据包含文本:
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
如何正确存储特殊字符?我尝试遵循此处文档中提供的指南: HDF5中的字符串-可变长度UTF-8
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
输出来自h5dump.以下答案验证了字符是否已正确存储为utf-8.
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
推荐答案
使用:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
我知道了
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
这就是h5py
确实将字符串视为unicode进行读取/解释.
That is h5py
does see/interpret the strings as unicode - writing and reading.
使用转储实用程序:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
请注意,在两种情况下,datatype
都标记为UTF8
Note that in both case the datatype
is marked UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
这就是文档所说的:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
它们可以存储Python Unicode字符串可以存储的任何字符,但NULL除外.在文件中,它们被创建为长度可变的字符串,字符集为H5T_CSET_UTF8.
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
让h5py
(或其他读者)担心将\37777777703\37777777670
解释为正确的unicode字符.
Let h5py
(or other reader) worry about interpreting \37777777703\37777777670
as the proper unicode character.
这篇关于使用Unicode将字符串数据集存储在HDF5中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!