将通过pandas/pytables编写的大型hdf5数据集转换为vaex [英] Convert large hdf5 dataset written via pandas/pytables to vaex

查看:657
本文介绍了将通过pandas/pytables编写的大型hdf5数据集转换为vaex的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的数据集,我通过如下追加将其分块写入hdf5:

I have a very large dataset I write to hdf5 in chunks via append like so:

with pd.HDFStore(self.train_store_path) as train_store:
    for filepath in tqdm(filepaths):
        with open(filepath, 'rb') as file:
            frame = pickle.load(file)

        if frame.empty:
            os.remove(filepath)
            continue

        try:
            train_store.append(
                key='dataset', value=frame,
                min_itemsize=itemsize_dict)
            os.remove(filepath)
        except KeyError as e:
            print(e)
        except ValueError as e:
            print(frame)
            print(e)
        except Exception as e:
            print(e) 

数据太大,无法加载到一个DataFrame中,因此我想尝试vaex进行进一步处理.不过有些事情我还是不太了解.

The data is far too large to load into one DataFrame, so I would like to try out vaex for further processing. There's a few things I don't really understand though.

由于vaex在hdf5中使用的表示形式不同于熊猫/pytables(VOTable),所以我想知道如何在这两种格式之间进行转换.我尝试将数据分块加载到熊猫中,将其转换为vaex DataFrame,然后进行存储,但是似乎无法将数据附加到现有的vaex hdf5文件中,至少找不到我可以找到的文件.

Since vaex uses a different representation in hdf5 than pandas/pytables (VOTable), I'm wondering how to go about converting between those two formats. I tried loading the data in chunks into pandas, converting it to a vaex DataFrame and then storing it, but there seems to be no way to append data to an existing vaex hdf5 file, at least none that I could find.

真的没有办法从vaex中创建大型hdf5数据集吗?是将现有数据集转换为vaex表示(通过python脚本或TOPCAT构造文件)的唯一选择吗?

Is there really no way to create a large hdf5 dataset from within vaex? Is the only option to convert an existing dataset to vaex' representation (constructing the file via a python script or TOPCAT)?

与我先前的问题有关,如果我在vaex中使用大型数据集进行核心处理,是否可以将我在vaex中应用的所有转换结果持久保存到hdf5文件中?

Related to my previous question, if I work with a large dataset in vaex out-of-core, is it possible to then persist the results of any transformations i apply in vaex into the hdf5 file?

推荐答案

此存储格式的问题在于它不是基于列的,这不适用于具有大量行的数据集,因为如果您只工作的话例如,如果使用1列,则OS可能还会读取其他列的大部分内容,并且CPU缓存也会因此受到污染.最好将它们存储为基于列的格式,例如vaex的hdf5格式或箭头.

The problem with this storage format is that it is not column-based, which does not play well with datasets with large number of rows, since if you only work with 1 column, for instance, the OS will probably also read large portions of the other columns, as well as the CPU cache gets polluted with it. It would be better to store them to a column based format such as vaex' hdf5 format, or arrow.

可以使用以下方法转换为vaex数据框:

Converting to a vaex dataframe can done using:

import vaex
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)

您可以对每个数据帧执行此操作,然后将它们作为hdf5或箭头存储在磁盘上:

You can do this for each dataframe, and store them on disk as hdf5 or arrow:

vaex_df.export('batch_1.hdf5')  # or 'batch_1.arrow'

如果您对许多文件执行此操作,则可以懒惰地(即不创建任何内存副本)将它们连接起来,或使用vaex.open函数:

If you do this for many files, you can lazily (i.e. no memory copies will be made) concatenate them, or use the vaex.open function:

df1 = vaex.open('batch_1.hdf5')
df2 = vaex.open('batch_2.hdf5')
df = vaex.concat([df1, df2]) # will be seen as 1 dataframe without mem copy
df_altnerative = vaex.open('batch*.hdf5')  # same effect, but only needs 1 line

关于您对转换的疑问:

如果要对数据框进行转换,则可以写出计算所得的值,或获取包含转换的状态":

If you do transformations to a dataframe, you can write out the computed values, or get the 'state', which includes the transformations:

import vaex
df = vaex.example()
df['difference'] = df.x - df.y
# df.export('materialized.hdf5', column_names=['difference'])  # do this if IO is fast, and memory abundant
# state = df.state_get()  # get state in memory
df.state_write('mystate.json') # or write as json


import vaex
df = vaex.example()
# df.join(vaex.open('materialized.hdf5'))  # join on rows number (super fast, 0 memory use!)
# df.state_set(state)  # or apply the state from memory
df.state_load('mystate.json')  # or from disk
df

这篇关于将通过pandas/pytables编写的大型hdf5数据集转换为vaex的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆