使用数组条目存储数据框 [英] Storing Dataframe with Array Entries
问题描述
我有一个具有以下结构的Pandas DataFrame,其中包含数字和固定形状的numpy数组:
I have a Pandas DataFrame with the following structure, which contains both numbers and numpy arrays of fixed shape:
import pandas as pd
import numpy as np
df = pd.DataFrame({"num":(23, 42), "list":(np.arange(3), np.arange(1,4))
假设我想快速存储和检索大量(超过1 GB)的数据,我该如何存储呢?如果我使用HDF5,则Numpy数组会被腌制,这将影响快速检索数据的能力.有什么方法可以告诉HDF5如何存储Numpy数组吗?另外,我应该完全不使用HDF5吗?
Assuming I have large (more than 1 GB) amounts of this data that I would like to store and retrieve quickly, how should I go about storing it? If I use HDF5, the Numpy array gets pickled which will affect the ability to retrieve the data quickly. Is there some way to tell HDF5 how to store Numpy arrays? Alternatively, should I not be using HDF5 at all?
以下GitHub线程似乎建议以下内容:
The following GitHub thread seems to suggest the following:
- Create a function that gets the desired Numpy array, which is stored in some other format [1]
- Create a class to inform HDF5 [2]
这两种解决方案似乎都怪异于我想象这个问题的普遍程度.还有更通用的方法吗?我只是在使用错误的工具吗?
Both of these solutions seem oddly specific for how common I imagine this problem to be. Are there more general approaches? Am I just using the wrong tool?
推荐答案
我的意思是这样的:
df_x = pd.concat([df.num, pd.DataFrame(np.vstack(df.list))],
keys=["key", "arr"], axis=1)
数据框:
key arr
num 0 1 2
0 23 0 1 2
1 42 1 2 3
转换为:
convert back with:
pd.concat([df_x.key, pd.Series(tuple(df_x.arr.values), name='list')], axis=1)
num list
0 23 [0, 1, 2]
1 42 [1, 2, 3]
这篇关于使用数组条目存储数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!