为什么将泡菜文件加载到内存中会占用更多空间? [英] Why loading a pickle file into memory will take much more space?

查看:59
本文介绍了为什么将泡菜文件加载到内存中会占用更多空间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件夹,其中包含 pickle.dump 保存的 7603 个文件.平均文件大小为6.5MB,因此文件占用的总磁盘空间约为48GB.

每个文件都是通过pickling一个list对象得到的,list的结构为

[A * 50]A = [str, int, [92 浮点数], B * 3]B = [C * 约 6]C = [str, int, [92 个浮点数]]

我使用的电脑内存是128GB.

但是,我无法通过此脚本将文件夹中的所有文件加载到内存中:

进口泡菜将多处理导入为 mp导入系统from os.path 导入连接从操作系统导入列表目录导入操作系统def one_loader(the_arg):以 open(the_arg, 'rb') 作为源:temp_fp = pickle.load(source)the_hash = the_arg.split('/')[-1]os.system('top -bn 1 | grep buff >> memory_log')返回(the_hash,temp_fp)def process_parallel(the_func, the_args):池 = mp.Pool(25)结果 = dict(pool.map(the_func, the_args))池.close()返回结果node_list = sys.argv[-1]db_path = db_paththe_hashes = listdir(db_path)the_files = [join(db_path, item) for the_hashes]fp_dict = {}fp_dict = process_parallel(one_loader, the_files)

我已经绘制了内存使用情况,您可以从脚本中看到,内存使用情况是

我对这个情节有几个困惑:

  1. 4000 个文件占用 25GB 磁盘空间,但为什么它们占用超过 100GB 内存?

  2. 在内存使用量突然下降后,我没有收到任何错误,我可以使用top命令看到脚本仍在运行.但我完全不知道系统在做什么,剩下的记忆在哪里.

解决方案

那只是因为序列化数据占用的空间比运行时管理对象所需的内存空间要少.

字符串示例:

进口泡菜用 open("foo","wb") 作为 f:pickle.dump("toto",f)

foo 在磁盘上是 14 个字节(包括 pickle 头或其他),但在内存中它要大得多:

<预><代码>>>>导入系统>>>sys.getsizeof('toto')53

对于字典来说更糟,因为哈希表(和其他东西):

import pickle,os,sysd = {"foo":"bar"}用 open("foo","wb") 作为 f:泡菜.转储(d,f)打印(os.path.getsize(foo"))打印(sys.getsizeof(d))

结果:

27288

1 比 10 的比例.

I have a folder contains 7603 files saved by pickle.dump. The average file size is 6.5MB, so the total disk space the files take is about 48GB.

Each file is obtained by pickling a list object, the list has a structure of

[A * 50] 
 A = [str, int, [92 floats], B * 3] 
                             B = [C * about 6] 
                                  C = [str, int, [92 floats]]

The memory of the computer I'm using is 128GB.

However, I cannot load all the files in the folder into memory by this script:

import pickle
import multiprocessing as mp
import sys
from os.path import join
from os import listdir
import os

def one_loader(the_arg):
    with open(the_arg, 'rb') as source:
        temp_fp = pickle.load(source)
    the_hash = the_arg.split('/')[-1]
    os.system('top -bn 1 | grep buff >> memory_log')
    return (the_hash, temp_fp)

def process_parallel(the_func, the_args):
    pool = mp.Pool(25)
    result = dict(pool.map(the_func, the_args))
    pool.close()
    return result

node_list = sys.argv[-1]
db_path =  db_path
the_hashes = listdir(db_path)
the_files = [join(db_path, item) for item in the_hashes]
fp_dict = {}
fp_dict = process_parallel(one_loader, the_files)

I have plotted the memory usage as you can see from the script, the memory usage is

I have several confusions about this plot:

  1. 4000 files take 25GB disk space, but why they take more than 100GB memory?

  2. After the sudden drop of the memory usage, I received no error, and I can see the script was still running by using top command. But I have completely no idea of what the system was doing, and where are the rest of the memories.

解决方案

That is just because serialized data takes less space than the space in memory needed to manage the object when running.

Example with a string:

import pickle

with open("foo","wb") as f:
    pickle.dump("toto",f)

foo is 14 bytes on the disk (including pickle header or whatever) but in memory it's much bigger:

>>> import sys
>>> sys.getsizeof('toto')
53

for a dictionary it's even worse, because of the hash tables (and other stuff):

import pickle,os,sys

d = {"foo":"bar"}
with open("foo","wb") as f:
    pickle.dump(d,f)
print(os.path.getsize("foo"))
print(sys.getsizeof(d))

result:

27
288

so a 1 to 10 ratio.

这篇关于为什么将泡菜文件加载到内存中会占用更多空间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆