在python中集成多个字典(大数据) [英] Integrating multiple dictionaries in python (big data)

查看:61
本文介绍了在python中集成多个字典(大数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事大数据挖掘的研究项目.我已经写了当前将组织的数据整理成字典的代码.但是,数据量如此之大,以至于在形成字典时,我的计算机内存不足.我需要定期将字典写入主存储器并以这种方式创建多个字典.然后,我需要比较生成的多个词典,相应地更新键和值,并将整个内容存储在磁盘上的一个大词典中.知道我如何在python中做到这一点吗?我需要一个可以将字典快速写入磁盘然后比较2个字典和更新密钥的api.实际上,我可以编写比较两个命令的代码,这不是问题,但是我需要做到这一点而不会耗尽内存.

I am working on a research project in big data mining. I have written the code currently to organize the data I have into a dictionary. However, The amount of data is so huge that while forming the dictionary, my computer runs out of memory. I need to periodically write my dictionary to main memory and create multiple dictionaries this way. I then need to compare the resulting multiple dictionaries, update the keys and values accordingly and store the whole thing in one big dictionary on disk. Any idea how I can do this in python? I need an api that can quickly write a dict to disk and then compare 2 dicts and update keys. I can actually write the code to compare 2 dicts, that's not a problem but I need to do it without running out of memory..

我的字典看起来像这样: 橙色":[是水果",非常好吃",...]

My dict looks like this: "orange" : ["It is a fruit","It is very tasty",...]

推荐答案

与霍夫曼(Hoffman)达成共识:使用关系数据库.数据处理对于关系引擎来说是一项不寻常的任务,但是相信,这是在易于使用/部署和大型数据集的速度之间的良好折衷.

Agree with Hoffman: go for a relational database. Data-processing is a bit of an unusual task for a relational engine, but believe, it is a good compromise between easy of use/deployment and speed for large datasets.

我通常使用Python附带的sqlite3,尽管我更经常通过 apsw .诸如sqlite3之类的关系引擎的优势在于,您可以指示它通过联接和更新来对数据进行大量处理,并且它将以一种非常明智的方式处理所需数据的所有内存/磁盘交换.您还可以使用内存数据库来保存需要与大数据进行交互的小数据,并通过"ATTACH"语句将它们链接起来.我已经以这种方式处理了千兆字节.

I customarily use sqlite3, that comes just with Python, although more often I use it through apsw. The advantage of a relational engine like sqlite3 is that you can instruct it to do a lot of processing with your data through joins and updates, and it will take care of all the memory/disk swapping of data required, in quite a sensible manner. You can also use in-memory databases to hold small data which you need interacting with your big data, and have them linked through "ATTACH" statements. I have processed gigabytes this way.

这篇关于在python中集成多个字典(大数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆