使用Python将大型字典存储到文件 [英] Store large dictionary to file in Python

查看:82
本文介绍了使用Python将大型字典存储到文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一本字典,其中包含许多条目和一个巨大的向量作为值.这些向量的尺寸可能是60.000,而且字典中有大约60.000条目.为了节省时间,我想在计算后将其存储.但是,使用泡菜会导致文件很大.我曾尝试存储到JSON,但是文件仍然很大(例如,在尺寸较小的50个条目的样本上为10.5 MB).我也读过关于稀疏矩阵的信息.由于大多数条目将为0,因此这是有可能的.这会减少文件大小吗?还有其他方法可以存储此信息吗?还是我只是倒霉?

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?

更新:

谢谢大家的答复.我想存储这些数据,因为这些是字数统计.例如,当给定句子时,我存储单词0(在数组中的位置0)出现在句子中的次数.显然,所有句子中的单词多于一个句子中出现的单词,因此有很多零.然后,我想使用此数组来训练至少三个(也许六个)分类器.创建带有单词计数的数组,然后在夜间运行分类器以进行训练和测试似乎更容易.我为此使用sklearn.选择此格式是为了与其他特征向量格式保持一致,这就是为什么我要采用这种方式来解决这个问题.如果这不是要走的路,在这种情况下,请告诉我.我非常了解在有效编码方面有很多东西要学习!

Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!

我也开始实现稀疏矩阵.该文件现在更大了(使用300个句子的样本集进行测试).

I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).

更新2:谢谢大家的提示.John Mee不需要存储数据是正确的.他和Mike McKerns都告诉我使用稀疏矩阵,这大大加快了计算速度!因此,谢谢您的投入.现在,我的武器库中有了一个新工具!

Update 2: Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!

推荐答案

请参阅我对一个非常相关的问题的回答 https://stackoverflow.com/a/25244747/2379433 ,如果可以将多个文件而不是单个文件进行酸洗.

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.

另请参见: https://stackoverflow.com/a/21948720/2379433 了解其他可能的改进,请参见此处也是: https://stackoverflow.com/a/24471659/2379433 .

Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.

如果您使用的是 numpy 数组,那么它会非常有效,因为 klepto joblib 都了解如何使用最小状态表示形式 array .如果确实将数组的大多数元素都设为零,那么一定要转换为稀疏矩阵...,您会发现数组的存储空间节省了很多.

If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.

正如上面的链接所讨论的,您可以使用 klepto -使用通用的API,您可以轻松地将字典存储到磁盘或数据库中. klepto 还使您可以选择一种存储格式( pickle json 等),其中 HDF5 是即将推出.它可以同时使用特殊的泡菜格式(例如 numpy )和压缩(如果您关心大小而不是速度).

As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).

klepto 为您提供了将字典与多合一"文件或单入"文件一起存储的选项,并且还可以利用多处理或多线程处理-这意味着您可以并行地将字典项保存到后端或从后端加载.

klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.

这篇关于使用Python将大型字典存储到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆