在Django的大型表格中,内存效率(常数)和速度优化迭代 [英] Memory efficient (constant) and speed optimized iteration over a large table in Django

查看:168
本文介绍了在Django的大型表格中,内存效率(常数)和速度优化迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张非常大的桌子。
它目前在MySQL数据库中。
我使用django。



我需要遍历表的每个元素来预先计算一些特定的数据(也许如果我更好的是我可以做其他的,但这不是重点)。



我希望保持迭代尽可能快地持续使用内存。 p>

由于在限制在* Large * Django QuerySet中的内存使用,django中的所有对象的简单迭代将会杀死机器,因为它将检索所有对象数据库。



解决方案



首先,为了减少你的内存消耗,你应该确定DEBUG是法lse(或猴子补丁游标:在保留设置的情况下关闭SQL日志记录。 DEBUG?)以确保django不存储连接中的东西进行调试。



但是即使这样,Model.objects.all()中的模型的

  
/ pre>

是不会去的。



甚至没有稍微改进的形式:

  Model.objects.all()。iterator()

使用 iterator() 将通过在内存中不存储缓存的结果来节省一些内存(尽管不一定在PostgreSQL!);但是,仍然会从数据库中检索整个对象。



一个天真的解决方案



第一个问题的解决方案是根据一个计数器将结果分成一个 chunk_size 。有几种方法可以写,但基本上它们都是在SQL中下降到一个 OFFSET + LIMIT 查询。



这样的东西:

  qs = Model.objects.all()
counter = 0
count = qs.count()
while counter < count:
在qs中的模型[counter:counter + count] .iterator()
yield model
counter + = chunk_size

虽然这是内存高效(与 chunk_size 成比例的常量内存使用量),但速度却非常差:as OFFSET增长,MySQL和PostgreSQL(并且可能大多数DB)将开始窒息和放缓。



更好的解决方案



这篇文章
它过滤PK,这比抵消更快(多快可能取决于DB)

  pk = 0 
last_pk = qs.order_by(' - pk')[0] .pk
queryset = qs.order_by('pk')
而pk < last_pk:
for qs.filter(pk__gt = pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()

这开始令人满意。现在Memory = O(C)和Speed〜= O(N)



更好解决方案的问题



更好的解决方案仅在查询集中可用的PK时有效。
不幸的是,并不总是如此,特别是当QuerySet包含distinct(group_by)和/或值(ValueQuerySet)的组合。



对于这种情况



我们可以做得更好吗?



现在我想知道我们可以更快地避免没有PK的QuerySets问题。
可能使用我在其他答案中找到的东西,但只能使用纯SQL:使用游标



由于我很对于原始的SQL,特别是在Django中,这里提出了真正的问题:



如何为大型表构建更好的Django QuerySet迭代器



我从我读过的是我们应该使用服务器端游标(显然(参见参考)使用标准的Django游标不会达到相同的结果,因为默认情况下,python-MySQL和psycopg连接器缓存结果)。



这真的是一个更快(和/或更高效)的解决方案吗?



这可以使用django中的原始SQL进行吗?或者我们应该根据数据库连接器编写特定的python代码?



PostgreSQL MySQL



这是我能够得到的那一刻...



a Django chunked_iterator()



queryset.iterator()而不是 iterate(queryset),并成为django核心或至少一个可插拔



更新感谢T在评论中找到一个 django ticket ,载有一些附加信息。连接器行为的差异使得它可能是最好的解决方案是创建一个特定的 chunked 方法,而不是透明地扩展 iterator 实施存根存在,但没有任何一年工作,看起来不像作者准备跳上去了。



其他参考文献:




  1. 为什么MYSQL更高的LIMIT偏移减慢了查询的速度吗?

  2. 如何在LIMIT子句中加快MySQL偏移量的MySQL查询?

  3. http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/

  4. postgresql:offset + limit得很慢

  5. 改进PostgreSQL中的OFFSET性能

  6. http://www.depesz.com/2011/05/20 / pagination-with-fixed-order /

  7. 如何在python中获取一行一行的MySQL结果集 MySQL中的服务器端游标



编辑:



Django 1.6正在添加永久数据库连接



Django数据库永久连接



这应该方便在某些情况下,使用光标。仍然是我目前的技能(和时间学习)如何实现这样一个解决方案。



此外,更好的解决方案绝对不能在所有情况下运行,不能用作通用方法,只有一个存根可以逐个修改...

解决方案

如果你想做一次迭代表中的所有内容,以下是非常有效的资源,远远超过基本的迭代器。注意,由于偏移操作的线性时间,主键的分页是有效实现所必需的。

  def table_iterator(model,page_size = 10000):
try:max = model.objects.all()。 order_by( - pk)[0] .pk
除了IndexError:return
pages = max / page_size + 1
for page_num in range(pages):
lower = page_num * page_size
page = model.objects.filter(pk__gte = lower,pk__lt = lower + page_size)
页面中的obj:
yield obj

使用外观如下:

  for table_iterator (模型):
#做东西


I have a very large table. It's currently in a MySQL database. I use django.

I need to iterate over each element of the table to pre-compute some particular data (maybe if I was better I could do otherwise but that's not the point).

I'd like to keep the iteration as fast as possible with a constant usage of memory.

As it is already clearly in Limiting Memory Use in a *Large* Django QuerySet and Why is iterating through a large Django QuerySet consuming massive amounts of memory?, a simple iteration over all objects in django will kill the machine as it will retrieve ALL objects from the database.

Towards a solution

First of all, to reduce your memory consumption you should be sure DEBUG is False (or monkey patch the cursor: turn off SQL logging while keeping settings.DEBUG?) to be sure django isn't storing stuff in connections for debug.

But even with that,

for model in Model.objects.all()

is a no go.

Not even with the slightly improved form:

for model in Model.objects.all().iterator()

Using iterator() will save you some memory by not storing the result of the cache internally (though not necessarily on PostgreSQL!); but will still retrieve the whole objects from the database, apparently.

A naive solution

The solution in the first question is to slice the results based on a counter by a chunk_size. There are several ways to write it, but basically they all come down to an OFFSET + LIMIT query in SQL.

something like:

qs = Model.objects.all()
counter = 0
count = qs.count()
while counter < count:     
    for model in qs[counter:counter+count].iterator()
        yield model
    counter += chunk_size

While this is memory efficient (constant memory usage proportional to chunk_size), it's really poor in term of speed: as OFFSET grows, both MySQL and PostgreSQL (and likely most DBs) will start choking and slowing down.

A better solution

A better solution is available in this post by Thierry Schellenbach. It filters on the PK, which is way faster than offsetting (how fast probably depends on the DB)

pk = 0
last_pk = qs.order_by('-pk')[0].pk
queryset = qs.order_by('pk')
while pk < last_pk:
    for row in qs.filter(pk__gt=pk)[:chunksize]:
        pk = row.pk
        yield row
    gc.collect()

This is starting to get satisfactory. Now Memory = O(C), and Speed ~= O(N)

Issues with the "better" solution

The better solution only works when the PK is available in the QuerySet. Unluckily, that's not always the case, in particular when the QuerySet contains combinations of distinct (group_by) and/or values (ValueQuerySet).

For that situation the "better solution" cannot be used.

Can we do better?

Now I'm wondering if we can go faster and avoid the issue regarding QuerySets without PK. Maybe using something that I found in other answers, but only in pure SQL: using cursors.

Since I'm quite bad with raw SQL, in particular in Django, here comes the real question:

how can we build a better Django QuerySet Iterator for large tables

My take from what I've read is that we should use server-side cursors (apparently (see references) using a standard Django Cursor would not achieve the same result, because by default both python-MySQL and psycopg connectors cache the results).

Would this really be a faster (and/or more efficient) solution?

Can this be done using raw SQL in django? Or should we write specific python code depending on the database connector?

Server Side cursors in PostgreSQL and in MySQL

That's as far as I could get for the moment...

a Django chunked_iterator()

Now, of course the best would have this method work as queryset.iterator(), rather than iterate(queryset), and be part of django core or at least a pluggable app.

Update Thanks to "T" in the comments for finding a django ticket that carry some additional information. Differences in connector behaviors make it so that probably the best solution would be to create a specific chunked method rather than transparently extending iterator (sounds like a good approach to me). An implementation stub exists, but there hasn't been any work in a year, and it does not look like the author is ready to jump on that yet.

Additional Refs:

  1. Why does MYSQL higher LIMIT offset slow the query down?
  2. How can I speed up a MySQL query with a large offset in the LIMIT clause?
  3. http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
  4. postgresql: offset + limit gets to be very slow
  5. Improving OFFSET performance in PostgreSQL
  6. http://www.depesz.com/2011/05/20/pagination-with-fixed-order/
  7. How to get a row-by-row MySQL ResultSet in python Server Side Cursor in MySQL

Edits:

Django 1.6 is adding persistent database connections

Django Database Persistent Connections

This should facilitate, under some conditions, using cursors. Still it's outside my current skills (and time to learn) how to implement such a solution..

Also, the "better solution" definitely does not work in all situations and cannot be used as a generic approach, only a stub to be adapted case by case...

解决方案

If all you want to do is iterate over everything in the table once, the following is very efficient in resources and far faster than the basic iterator. Note that paging by primary key is necessary for efficient implementation due to the linear time of the offset operation.

def table_iterator(model, page_size=10000):
    try: max = model.objects.all().order_by("-pk")[0].pk
    except IndexError: return 
    pages = max / page_size + 1
    for page_num in range(pages):
        lower = page_num * page_size
        page = model.objects.filter(pk__gte=lower, pk__lt=lower+page_size)
        for obj in page:
            yield obj

Use looks like:

for obj in table_iterator(Model):
    # do stuff

这篇关于在Django的大型表格中,内存效率(常数)和速度优化迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆