在数据库中存储一个700万个关键字的Python字典 [英] Storing a 7millions keys python dictionary in a database

查看：146 发布时间：2017/5/21 22:03:36 python dictionary sqlite3 shelve

本文介绍了在数据库中存储一个700万个关键字的Python字典的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须处理一个7百万个密钥字典（密钥的数量最多可达5000万）。由于我几乎没有足够的RAM来保存它，因此我决定存储它。

我的字典看起来像这样：

  dictionary = {（int1，int2）：int3，...}

首先，我尝试使用sqlite3将其存储在sqlite数据库中。
存储所需的时间是完全可以的（约70秒）。使用 timeit ：

 >>> import sqlite3 
>>> conn = sqlite3.connect（'test_sqlite.sqlite'）
>>> c = conn.cursor（）
>>> c.execute （'create table test（int1 int，int2 int，int3 int）'）
>>> conn.commit（）
>> conn.close（）
>>> import timeit 
>>> timeit.timeit（'c.executemany（insert into test values（？，？，？），（（key [0] [1]，dictionary [key]）for key in dictionary.iterkeys（））），setup ='import sqlite3; conn = sqlite3.connect（test_sqlite.sqlite）; c = conn.cursor（）; dictionary = { （i，i + 1）：i + 2 for i in xrange（7000000）}'，number = 1）
 70.7033872604

但是，我需要使用这个存储的字典来检索某些值，但是每个SELECT似乎需要大约1.5秒。因为我需要访问大约一百万个值，所以这是令人鼓舞的：

 >>> timeit.timeit（'c。执行（select id1 from test where id2 == {}。format（value））。fetchone（）[0]'，setup = import sqlite3; conn = sqlite3.connect（test_sqlite.sqlite）; c = conn .cursor（）; value = 5555'，number = 1）
 1.5300869941711426

然后我试图在架子上更新我的字典。现在在我的搁置字典中获得价值的时间相当不错：

 >>> timeit.timeit（'a = f [key]'，setup ='import shelve; f = shelve.open（test_timeit，r）; key =1000'，number = 10000）
 0.320019006729126

所以即使我这样做了几百万个请求，总的时间应该是

但是出现了一个新的问题，现在把我的字典存储在一个架子上所需的时间并不能令我满意。

 >>> timeit.timeit（'f.update（dictio）'，setup ='import shelve; f = shelve.open（test_timeit，c）; dictio = {（{}，{}） ，i + 1）：i + 2 for i in xrange（7000000）}'，number = 1）
 504.728841782

必须添加一个额外的时间，将以前的键（即元组）转换为字符串所需的额外时间。使用repr：

 >>> timeit.timeit（'repr.repr（（1,2））' setup ='import repr'，number = 7000000）
 61.6035461426

566.332387924将我的字典更新为书架...

我不想腌我的字典，因为这意味着我必须加载整个字典if我以后要使用它。

有没有什么方法可以改进这两种方法之一，以便有更好的访问时间/加载时间？

感谢您的帮助！

解决方案

像这样快速返回，您需要索引相关的列。在你的情况下，我将其添加为主键。

 创建表格测试（
 Int1整数，
 Int2整数，
 Int3整数，
主键（int1，int2）
）

I have to handle a 7 millions keys dictionary (the number of keys can eventually be up to ~50 millions). Since I have barely enough ram to keep it in memory I've decided to store it.

My dictionary looks like this:

dictionary={(int1,int2):int3,...}

First I tried to store it in a sqlite database using sqlite3. The amount of time required to store it is perfectly ok (around 70 secs). Using timeit:

>>>import sqlite3
>>>conn=sqlite3.connect('test_sqlite.sqlite')
>>>c=conn.cursor()
>>>c.execute('create table test (int1 int, int2 int, int3 int)')
>>>conn.commit()
>>>conn.close()
>>>import timeit
>>>timeit.timeit('c.executemany("insert into test values (?,?,?)",((key[0],key[1],dictionary[key]) for key in dictionary.iterkeys())),setup='import sqlite3;conn=sqlite3.connect("test_sqlite.sqlite");c=conn.cursor();dictionary={(i,i+1):i+2 for i in xrange(7000000)}',number=1)
70.7033872604

But then, I need to use this stored dictionary in order to retrieve certain values, but each SELECT seems to take approximately 1.5 secs. Since I need to access around one million values it is discouraging:

>>>timeit.timeit('c.execute("select id1 from test where id2=={}".format(value)).fetchone()[0]',setup=import sqlite3;conn=sqlite3.connect("test_sqlite.sqlite");c=conn.cursor();value=5555',number=1)
1.5300869941711426

Then I tried to update my dictionary in a shelf. Now the amount of time to get a value in my shelved dictionary is fairly good:

>>> timeit.timeit('a=f[key]',setup='import shelve;f=shelve.open("test_timeit","r");key="1000"',number=10000)
0.320019006729126

So even though I do several millions requests like this one, the total amount of time should be around a hundred of secs.

But a new problem arose, for now the time required to store my dictionary in a shelf doesn't satisfie me.

>>> timeit.timeit('f.update(dictio)',setup='import shelve;f=shelve.open("test_timeit","c");dictio={"({},{})".format(i,i+1):i+2 for i in xrange(7000000)}',number=1)
504.728841782

One must add to this amount a time extra time required to convert the former keys (which are tuples) to string. Using repr:

>>>timeit.timeit('repr.repr((1,2))',setup='import repr',number=7000000)
61.6035461426

Which makes a total of 566.332387924 to update my dictionary into a shelf ...

I don't want to pickle my dictionary, since it implies that I'll have to load the whole dictionary if I want to use it later.

Is there any way I can improve one of these two methods in order to have better access times/loading times ?

Thanks for your help !

解决方案

For queries on large tables like this to return quickly, you need to index the relevant columns. In your case I would add this as the primary key.

create table test (
    Int1 integer,
    Int2 integer,
    Int3 integer,
    Primary key (int1, int2)
)

这篇关于在数据库中存储一个700万个关键字的Python字典的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在数据库中存储一个700万个关键字的Python字典 [英] Storing a 7millions keys python dictionary in a database

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在数据库中存储一个700万个关键字的Python字典 [英] Storing a 7millions keys python dictionary in a database

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭