pandas :将系列字典保存到磁盘 [英] Pandas : saving Series of dictionaries to disk

查看:150
本文介绍了 pandas :将系列字典保存到磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python pandas系列字典:

I have a python pandas Series of dictionaries :

id           dicts
1            {'5': 1, '8': 20, '1800': 2}
2            {'2': 2, '8': 1, '1000': 25, '1651': 1}
...          ...
...          ...
...          ...
20000000     {'2': 1, '10': 20}

词典中的(键,值)表示('功能',计数).存在大约2000个独特功能.

The (key, value) in the dictionaries represent ('feature', count). About 2000 unique features exist.

该系列在熊猫中的内存使用量约为500MB. 将对象写入磁盘的最佳方法是什么(理想情况下,磁盘空间使用率较低,并且写入后又很快回读)?

The Series' memory usage in pandas is about 500MB. What would be the best way to write this object to disk (having ideally low disk space usage, and being fast to write and fast to read back in afterwards) ?

考虑的选项(并尝试了前2个):
-to_csv(但将字典视为字符串,因此之后转换回字典非常慢)
-cPickle(但在执行期间耗尽了内存)
-转换为稀疏的稀疏矩阵结构

Options considered (and tried for the first 2) :
- to_csv (but treats the dictionaries as strings, so conversion back to dictionaries afterwards is very slow)
- cPickle (but ran out of memory during execution)
- conversion to a scipy sparse matrix structure

推荐答案

我很好奇您的Series如何仅占用500MB.如果您使用的是.memory_usage方法,则只会返回每个python对象引用所使用的总内存,这是Series系列存储的所有内容.那不算字典的实际记忆.粗略计算20,000,000 * 288字节= 5.76GB应该是您的内存使用量. 288个字节是每个字典所需内存的保守估计.

I'm curious as to how your Series only takes up 500MB. If you are using the .memory_usage method, this will only return the total memory used by the each python object reference, which is all your Series is storing. That doesn't account for the actual memory of the dictionaries. Rough calculation 20,000,000 * 288 bytes = 5.76GB should be your memory usage. That 288 bytes is a conservative estimate of the memory required by each dictionary.

无论如何,请尝试以下方法将数据转换为稀疏矩阵表示形式:

Anyway, try the following approach to convert your data into a sparse-matrix representation:

import numpy as np, pandas as pd
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import csr_matrix
import pickle

我将使用int而不是字符串作为键,因为这将在以后保持正确的顺序.因此,假设您的系列名为dict_series:

I would use ints rather than strings as keys, as this will keep the right order later on. So, assuming your series is named dict_series:

dict_series = dict_series.apply(lambda d: {int(k):d[k] for k in d}

这可能会占用大量内存,并且从一开始就使用int作为键来创建dictSeries可能会更好.或者只是您可以跳过此步骤.现在,构建您的稀疏矩阵:

This might be memory intensive, and you maybe be better off simply creating your Series of dicts using ints as keys from the start. Or simply you can just skip this step. Now, to construct your sparse matrix:

dv = DictVectorizer(dtype=np.int32)
sparse = dv.fit_transform(dict_series)

保存到磁盘

现在,从本质上讲,稀疏矩阵可以从3个字段中重建:sparse.datasparse.indicessparse.indptr,可选的sparse.shape.保存数组sparse.datasparse.indicessparse.indptr的负载最快,最节省内存的方法是使用np.ndarray tofile方法,该方法将数组保存为原始字节.从文档:

Saving to disk

Now, essentially, your sparse matrix can be reconstructed from 3 fields: sparse.data, sparse.indices, sparse.indptr, an optionally, sparse.shape. The fastest and most memory efficient way to save an load the arrays sparse.data, sparse.indices, sparse.indptr is to use the np.ndarray tofile method, which saves the arrays as raw bytes. From the documentation:

这是用于快速存储阵列数据的便利功能. 有关字节序和精度的信息会丢失,因此不会使用此方法 是用于归档数据或传输数据的文件的不错选择 在具有不同字节序的机器之间.

This is a convenience function for quick storage of array data. Information on endianness and precision is lost, so this method is not a good choice for files intended to archive data or transport data between machines with different endianness.

因此,此方法会丢失所有dtype信息和固有性.可以通过简单地事先记下数据类型来解决前一个问题,无论如何,您都将使用np.int32.如果您是在本地工作,那么后一个问题就不成问题,但是如果可移植性很重要,则需要寻找其他存储信息的方法.

So this method loses any dtype information and endiamness. The former issue can be dealt with simply by making note of the datatype before hand, you'll be using np.int32 anyway. The latter issue isn't a problem if you are working locally, but if portability is important, you will need to look into alternate ways of storing the information.

# to save
sparse.data.tofile('data.dat')
sparse.indices.tofile('indices.dat')
sparse.indptr.tofile('indptr.dat')
# don't forget your dict vectorizer!
with open('dv.pickle', 'wb') as f:
    pickle.dump(dv,f) # pickle your dv to be able to recover your original data!

要恢复一切:

with open('dv.pickle', 'rb') as f:
    dv = pickle.load(f)

sparse = csr_matrix((np.fromfile('data.dat', dtype = np.int32),
                     np.fromfile('indices.dat', dtype = np.int32),
                     np.fromfile('indptr.dat', dtype = np.int32))

original = pd.Series(dv.inverse_transform(sparse))

这篇关于 pandas :将系列字典保存到磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆