mongodb比sqlite慢4倍,比csv慢2倍? [英] mongodb 4x slower than sqlite, 2x slower than csv?

查看:90
本文介绍了mongodb比sqlite慢4倍,比csv慢2倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在比较两个数据库的性能,再加上csv-数据是1百万行乘5列浮点数,批量插入sqlite/mongodb/csv中,是用python完成的.

I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.

import csv
import sqlite3
import pymongo

N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]

写入csv需要6.7秒:

writing to csv takes 6.7 seconds:

%%time
with open('test.csv', 'w', newline='') as file:
    writer = csv.writer(file, delimiter=',')
    for i in range(N):
        writer.writerow(data[i])

写入sqlite3需要3.6秒:

writing to sqlite3 takes 3.6 seconds:

%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)

写入mongo需要14.2秒:

writing to mongo takes 14.2 seconds:

%%time
with pymongo.MongoClient() as client:
    start_w = time()
    client['warmup']['warmup'].insert_many(docs)
    start_w = time()
    db = client['test']
    coll = db['test']
    start = time()
    coll.insert_many(docs)
    end = time()

我对此还不陌生,但是在类似的情况下,预计mongodb的sqlite速度可能会慢4倍,而与csv相比会慢2倍?它基于带WiredTiger引擎的mongodb v4.4和python3.8.

I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.

我知道当没有固定的模式时mongodb会很出色,但是当每个文档具有完全相同的key:value对时,就像上面的示例一样,是否有一些方法可以加快批量插入?

I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?

我测试了在真实"写之前添加预热,如@D.SM建议.它有帮助,但总的来说,它仍然是最慢的.我的意思是说,总的Wall时间为23.9s(预热14.2 +实际插入9.6).有趣的是,CPU时间总计为18.1s,这意味着在 .insert_many()方法中花费了23.9-18.1 = 5.8s来等待TCP/IO?听起来很多.

I tested adding a warmup in front of the 'real' write, as @D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.

无论如何,即使我使用了预热功能而无视IO等待时间,实际写操作剩余的时间仍可能大于csv写操作,这是一百万次write()调用!显然,csv编写器在缓冲/缓存方面做得更好.我在这里遇到严重的错误吗?

In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?

另一个与之相关的问题:收集文件(/var/lib/mongodb/collection-xxx)的大小似乎不是线性增长的,从第一批开始,对于每百万个插入,大小增加了57MB,15MB,75MB,38MB,45MB,68MB.据我了解,压缩后的随机数据的大小可能会有所不同,但变化似乎很大.这是预期的吗?

Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?

推荐答案

MongoDB客户端在后台连接到服务器.如果要对插入物进行基准测试,则更精确的测试应如下所示:

MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:

with pymongo.MongoClient() as client:
  client['warmup']['warmup'].insert_many(docs)
  db = client['test']
  coll = db['test']
  start = time()
  coll.insert_many(docs)
  end = time()

请记住,insert_many执行批量写入,并且批量写入大小受到限制,特别是每个批量写入只能有1000个命令.如果要发送100万个插入,则可能是每个批量写入有2000个拆分,所有拆分都涉及数据副本.与其他批次大小一起测试一次插入1000个文档.

Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.

工作测试:



import csv
import sqlite3
import pymongo, random, time

N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
    docs[i]=dict(docs[i])
    docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]

with open('test.csv', 'w', newline='') as file:
    writer = csv.writer(file, delimiter=',')
    start = time.time()
    for i in range(N):
        writer.writerow(data[i])
    end = time.time()
    print('%f' %( end-start))


con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)


end = time.time()
print('%f' %( end-start))



with pymongo.MongoClient() as client:
  client['warmup']['warmup'].delete_many({})
  client['test']['test'].delete_many({})
  client['warmup']['warmup'].insert_many(docs)
  db = client['test']
  coll = db['test']
  start = time.time()
  coll.insert_many(docs)
  end = time.time()
print('%f' %( end-start))

结果:

risque% python3 test.py
0.001464
0.002031
0.022351

risque% python3 test.py
0.013875
0.019704
0.153323

risque% python3 test.py
0.147391
0.236540
1.631367

risque% python3 test.py
1.492073
2.063393
16.289790

MongoDB大约是sqlite时间的8倍.

MongoDB is about 8x the sqlite time.

这是预期的吗?也许.sqlite和mongodb之间的比较并没有显示太多,除了sqlite明显更快.但是,自然地,是可以预期的,因为mongodb使用客户端/服务器体系结构,而sqlite是进程内数据库,这意味着:

Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:

  • 客户端必须序列化数据才能发送到服务器
  • 服务器必须反序列化该数据
  • 然后服务器必须解析请求并弄清楚该怎么做
  • 服务器需要以可伸缩/并发的方式写入数据(sqlite仅仅是我记得的错误,并发写入错误)
  • 服务器需要将响应撰写回客户端,序列化该响应,然后将其写入网络
  • 客户需要阅读响应,反序列化,检查响应是否成功

5.8s花费在.insert_many()方法内部,等待TCP/IO?听起来很多.

5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.

与之相比-一个不进行任何网络I/O操作的进程内数据库?

Compared to what - an in-process database that does not do any network i/o?

实际写操作剩余的时间可能仍然大于csv写操作,这是一百万次write()调用

the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls

物理写调用只是现代数据库用于数据存储的一小部分.

The physical write calls are a small part of what goes into data storage by a modern database.

此外,这两种情况都不涉及一百万.当您写文件时,写操作会在甚至发送到内核之前由python的标准库缓冲-您必须在每行之后使用 flush()才能实际产生一百万个写操作.在数据库中,写操作类似地逐页执行,而不是针对单个文档执行.

Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush() after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.

这篇关于mongodb比sqlite慢4倍,比csv慢2倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆