使用数组条目存储数据框 [英] Storing Dataframe with Array Entries

查看:107
本文介绍了使用数组条目存储数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的Pandas DataFrame,其中包含数字和固定形状的numpy数组:

I have a Pandas DataFrame with the following structure, which contains both numbers and numpy arrays of fixed shape:

import pandas as pd
import numpy as np

df = pd.DataFrame({"num":(23, 42), "list":(np.arange(3), np.arange(1,4))

假设我想快速存储和检索大量(超过1 GB)的数据,我该如何存储呢?如果我使用HDF5,则Numpy数组会被腌制,这将影响快速检索数据的能力.有什么方法可以告诉HDF5如何存储Numpy数组吗?另外,我应该完全不使用HDF5吗?

Assuming I have large (more than 1 GB) amounts of this data that I would like to store and retrieve quickly, how should I go about storing it? If I use HDF5, the Numpy array gets pickled which will affect the ability to retrieve the data quickly. Is there some way to tell HDF5 how to store Numpy arrays? Alternatively, should I not be using HDF5 at all?

以下GitHub线程似乎建议以下内容:

The following GitHub thread seems to suggest the following:

  1. 创建一个获取所需Numpy数组的函数,该函数以其他格式存储 [1]
  2. 创建一个类来通知HDF5 [2]
  1. Create a function that gets the desired Numpy array, which is stored in some other format [1]
  2. Create a class to inform HDF5 [2]

这两种解决方案似乎都怪异于我想象这个问题的普遍程度.还有更通用的方法吗?我只是在使用错误的工具吗?

Both of these solutions seem oddly specific for how common I imagine this problem to be. Are there more general approaches? Am I just using the wrong tool?

推荐答案

我的意思是这样的:

df_x = pd.concat([df.num, pd.DataFrame(np.vstack(df.list))], 
                 keys=["key", "arr"], axis=1)

数据框:

  key arr      
  num   0  1  2
0  23   0  1  2
1  42   1  2  3


转换为:


convert back with:

pd.concat([df_x.key, pd.Series(tuple(df_x.arr.values), name='list')], axis=1)

   num       list
0   23  [0, 1, 2]
1   42  [1, 2, 3]

这篇关于使用数组条目存储数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆