与本机pymongo相比,Mongoengine在大型文档上的运行速度非常慢 [英] Mongoengine is very slow on large documents compared to native pymongo usage

查看:104
本文介绍了与本机pymongo相比,Mongoengine在大型文档上的运行速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下mongoengine模型:

I have the following mongoengine model:

class MyModel(Document):
    date        = DateTimeField(required = True)
    data_dict_1 = DictField(required = False)
    data_dict_2 = DictField(required = True)

在某些情况下,数据库中的文档可能非常大(大约5-10MB),并且data_dict字段包含复杂的嵌套文档(dict列表中的dict等).

In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).

我遇到了两个(可能是相关的)问题:

I have encountered two (possibly related) issues:

  1. 当我运行本地pymongo find_one()查询时,它在一秒钟内返回.当我运行MyModel.objects.first()时,它需要5到10秒钟.
  2. 当我从数据库中查询一个大文档,然后访问其字段时,只需10到20秒即可完成以下操作:

  1. When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
  2. When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following:

m = MyModel.objects.first()
val = m.data_dict_1.get(some_key)

对象中的数据不包含对任何其他对象的任何引用,因此这不是对象取消引用的问题.
我怀疑这与mongoengine的内部数据表示效率低下有关,这会影响文档对象的构造以及字段的访问.有什么我可以做些改善的吗?

The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing.
I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access. Is there anything I can do to improve this ?

推荐答案

TL; DR:mongoengine花费了很多时间将所有返回的数组转换为字典

为了测试这一点,我建立了一个文档集,其中包含一个带有大嵌套dictDictField文档.该文档大约在5-10MB的范围内.

To test this out I built a collection with a document with a DictField with a large nested dict. The doc being roughly in your 5-10MB range.

然后我们可以使用 timeit.timeit 来确认使用pymongo和mongoengine.

We can then use timeit.timeit to confirm the difference in reads using pymongo and mongoengine.

然后我们可以使用 pycallgraph

We can then use pycallgraph and GraphViz to see what is taking mongoengine so damn long.

下面是完整的代码:

import datetime
import itertools
import random
import sys
import timeit
from collections import defaultdict

import mongoengine as db
from pycallgraph.output.graphviz import GraphvizOutput
from pycallgraph.pycallgraph import PyCallGraph

db.connect("test-dicts")


class MyModel(db.Document):
    date = db.DateTimeField(required=True, default=datetime.date.today)
    data_dict_1 = db.DictField(required=False)


MyModel.drop_collection()

data_1 = ['foo', 'bar']
data_2 = ['spam', 'eggs', 'ham']
data_3 = ["subf{}".format(f) for f in range(5)]

m = MyModel()
tree = lambda: defaultdict(tree)  # http://stackoverflow.com/a/19189366/3271558
data = tree()
for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3):
    data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000))
m.data_dict_1 = data
m.save()


def pymongo_doc():
    return db.connection.get_connection()["test-dicts"]['my_model'].find_one()


def mongoengine_doc():
    return MyModel.objects.first()


if __name__ == '__main__':
    print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10)))
    print("mongoengine took", timeit.timeit(mongoengine_doc, number=10))
    with PyCallGraph(output=GraphvizOutput()):
        mongoengine_doc()

并且输出证明mongoengine与pymongo相比非常慢:

And the output proves that mongoengine is being very slow compared to pymongo:

pymongo took 0.87s
mongoengine took 25.81118331072267

生成的调用图非常清楚地说明了瓶颈在哪里:

The resulting call graph illustrates pretty clearly where the bottle neck is:

基本上,mongoengine将在从数据库中获取的每个DictField上调用to_python方法. to_python相当慢,在我们的示例中,它被称为疯狂的次数.

Essentially mongoengine will call the to_python method on every DictField that it gets back from the db. to_python is pretty slow and in our example it's being called an insane number of times.

Mongoengine用于将您的文档结构优雅地映射到python对象.如果您有非常大的非结构化文档(mongodb最适合该文档),则mongoengine并不是真正的正确工具,您应该只使用pymongo.

Mongoengine is used to elegantly map your document structure to python objects. If you have very large unstructured documents (which mongodb is great for) then mongoengine isn't really the right tool and you should just use pymongo.

但是,如果您知道结构,则可以使用EmbeddedDocument字段从mongoengine获得稍微更好的性能.我已经运行了类似但不等效的测试此要点中的代码,输出为:

However, if you know the structure you can use EmbeddedDocument fields to get slightly better performance from mongoengine. I've run a similar but not equivalent test code in this gist and the output is:

pymongo with dict took 0.12s
pymongo with embed took 0.12s
mongoengine with dict took 4.3059175412661075
mongoengine with embed took 1.1639373211854682

因此,您可以使mongoengine更快,但pymongo仍然要快得多.

So you can make mongoengine faster but pymongo is much faster still.

更新

这里pymongo界面的一个很好的捷径是使用聚合框架:

A good shortcut to the pymongo interface here is to use the aggregation framework:

def mongoengine_agg_doc():
    return list(MyModel.objects.aggregate({"$limit":1}))[0]

这篇关于与本机pymongo相比,Mongoengine在大型文档上的运行速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆