内存高效(恒定)和速度优化迭代在Django中的一个大表 [英] Memory efficient (constant) and speed optimized iteration over a large table in Django

查看:142
本文介绍了内存高效(恒定)和速度优化迭代在Django中的一个大表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张很大的桌子。
它目前在一个MySQL数据库。
我使用django。



我需要遍历表的每个元素来预先计算一些特定的数据



我想尽可能快地保持迭代,并且不断使用内存。

p>

由于它已经在限制内存在* Large * Django查询集中使用为什么迭代通过一个大型Django QuerySet消耗大量内存?,对django中的所有对象的简单迭代将杀死机器,因为它将检索所有对象数据库。



走向解决方案



首先,为了减少内存消耗,你应该确保DEBUG为False(或猴子修补光标:在保留设置的同时关闭SQL日志记录.DEBUG?)以确保django不会在连接中存储用于调试的东西。



但是即使这样,Model.objects.all()中的模型

  

是不行。



即使稍稍改进的形式也不例外:

  Model.objects.all()。iterator()

使用 iterator() 将通过不在内部存储缓存结果(虽然不一定在PostgreSQL!)来节省一些内存。但仍然会从数据库检索整个对象。



一个朴素的解决方案



=http://stackoverflow.com/a/5188179/422670>第一个问题的解决方案是根据计数器将结果分成 chunk_size 。有几种写法,但基本上它们都是在SQL中的 OFFSET + LIMIT 查询。



例如:

  qs = Model.objects.all()
counter = 0
count = qs.count()
while counter< count:
for model in qs [counter:counter + count] .iterator()
yield model
counter + = chunk_size

虽然这是内存效率(与 chunk_size 成比例的恒定内存使用),但它的速度真的很差:as OFFSET增长,MySQL和PostgreSQL(可能大多数DB)将开始窒息和减速。



更好的解决方案



更好的解决方案可以在这篇文章由Thierry Schellenbach。
它过滤的PK,这是方式快于抵消(多快取决于DB)

  pk = 0 
last_pk = qs.order_by(' - pk')[0] .pk
queryset = qs.order_by('pk')
while pk < last_pk:
for qs.filter(pk__gt = pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()

这开始令人满意。现在内存= O(C)和速度〜= O(N)



更好解决方案的问题



更好的解决方案仅在PK在QuerySet中可用时有效。不幸的是,并不总是这样,特别是当QuerySet包含不同(group_by)和/或值(ValueQuerySet)的组合时。



不能使用更好的解决方案。



我们能做得更好吗?



我们可以更快地,避免有关没有PK的QuerySet的问题。
也许使用我在其他答案中发现的东西,但只能在纯SQL中使用: cursors



糟糕的原始SQL,特别是在Django,这里有一个真正的问题:



我们如何为大表创建一个更好的Django QuerySet迭代器



我从我读过的内容是我们应该使用服务器端游标(显然(参见参考资料)使用标准的Django游标不会实现相同的结果,因为默认情况下,python-MySQL和psycopg连接器都缓存结果)。



这真的是一个更快(和/ b
$ b

这可以使用django中的原始SQL吗?或者我们应该根据数据库连接器编写特定的Python代码吗?



服务器端游标 PostgreSQL MySQL



这是我现在可以得到的...



Django chunked_iterator()



现在,最好的方法是使用 queryset.iterator(),而不是 iterate(queryset),并且是django core或至少一个可插入应用程序的一部分。



更新感谢T在评论中找到 django ticket ,其中包含一些附加信息。连接器行为的差异使得它可能最好的解决方案是创建一个特定的 chunked 方法,而不是透明地扩展 iterator (听起来像是一个很好的方法)。
实施存根存在,但没有任何工作



其他参考:




  1. 为什么MYSQL更高的LIMIT偏移会减慢查询速度?

  2. 如何在LIMIT子句中加快具有大偏移量的MySQL查询?

  3. http: //explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/

  4. postgresql:offset + limit会非常慢

  5. 改善PostgreSQL中的OFFSET性能

  6. http://www.depesz.com/2011/05/ 20 / pagination-with-fixed-order /

  7. MySQL中的服务器侧游标



修改:



Django 1.6正在添加持久性数据库连接



Django数据库持久连接

在某些情况下,这应该有助于使用游标。仍然是我目前的技能(和时间学习)如何实现这样的解决方案。



此外,更好的解决方案绝对不工作在所有情况下,不能作为一个通用的方法,只有一个存根要适应的情况下...

解决方案

如果你想要do是对表中的所有内容进行一次迭代,以下是在资源中非常高效,并且远快于基本迭代器。注意,由于偏移操作的线性时间,通过主键的分页对于高效实现是必要的。

  def table_iterator(model,page_size = 10000):
try:max = model.objects.all除了IndexError:order_by( - pk)[0] .pk
:return
pages = max / page_size + 1
for page_num in range(pages):
lower = page_num * page_size
page = model.objects.filter(pk__gte = lower,pk__lt = lower + page_size)
对于页面中的obj:
yield obj

使用样式:

  for table_iterator (模型):
#do stuff


I have a very large table. It's currently in a MySQL database. I use django.

I need to iterate over each element of the table to pre-compute some particular data (maybe if I was better I could do otherwise but that's not the point).

I'd like to keep the iteration as fast as possible with a constant usage of memory.

As it is already clearly in Limiting Memory Use in a *Large* Django QuerySet and Why is iterating through a large Django QuerySet consuming massive amounts of memory?, a simple iteration over all objects in django will kill the machine as it will retrieve ALL objects from the database.

Towards a solution

First of all, to reduce your memory consumption you should be sure DEBUG is False (or monkey patch the cursor: turn off SQL logging while keeping settings.DEBUG?) to be sure django isn't storing stuff in connections for debug.

But even with that,

for model in Model.objects.all()

is a no go.

Not even with the slightly improved form:

for model in Model.objects.all().iterator()

Using iterator() will save you some memory by not storing the result of the cache internally (though not necessarily on PostgreSQL!); but will still retrieve the whole objects from the database, apparently.

A naive solution

The solution in the first question is to slice the results based on a counter by a chunk_size. There are several ways to write it, but basically they all come down to an OFFSET + LIMIT query in SQL.

something like:

qs = Model.objects.all()
counter = 0
count = qs.count()
while counter < count:     
    for model in qs[counter:counter+count].iterator()
        yield model
    counter += chunk_size

While this is memory efficient (constant memory usage proportional to chunk_size), it's really poor in term of speed: as OFFSET grows, both MySQL and PostgreSQL (and likely most DBs) will start choking and slowing down.

A better solution

A better solution is available in this post by Thierry Schellenbach. It filters on the PK, which is way faster than offsetting (how fast probably depends on the DB)

pk = 0
last_pk = qs.order_by('-pk')[0].pk
queryset = qs.order_by('pk')
while pk < last_pk:
    for row in qs.filter(pk__gt=pk)[:chunksize]:
        pk = row.pk
        yield row
    gc.collect()

This is starting to get satisfactory. Now Memory = O(C), and Speed ~= O(N)

Issues with the "better" solution

The better solution only works when the PK is available in the QuerySet. Unluckily, that's not always the case, in particular when the QuerySet contains combinations of distinct (group_by) and/or values (ValueQuerySet).

For that situation the "better solution" cannot be used.

Can we do better?

Now I'm wondering if we can go faster and avoid the issue regarding QuerySets without PK. Maybe using something that I found in other answers, but only in pure SQL: using cursors.

Since I'm quite bad with raw SQL, in particular in Django, here comes the real question:

how can we build a better Django QuerySet Iterator for large tables

My take from what I've read is that we should use server-side cursors (apparently (see references) using a standard Django Cursor would not achieve the same result, because by default both python-MySQL and psycopg connectors cache the results).

Would this really be a faster (and/or more efficient) solution?

Can this be done using raw SQL in django? Or should we write specific python code depending on the database connector?

Server Side cursors in PostgreSQL and in MySQL

That's as far as I could get for the moment...

a Django chunked_iterator()

Now, of course the best would have this method work as queryset.iterator(), rather than iterate(queryset), and be part of django core or at least a pluggable app.

Update Thanks to "T" in the comments for finding a django ticket that carry some additional information. Differences in connector behaviors make it so that probably the best solution would be to create a specific chunked method rather than transparently extending iterator (sounds like a good approach to me). An implementation stub exists, but there hasn't been any work in a year, and it does not look like the author is ready to jump on that yet.

Additional Refs:

  1. Why does MYSQL higher LIMIT offset slow the query down?
  2. How can I speed up a MySQL query with a large offset in the LIMIT clause?
  3. http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
  4. postgresql: offset + limit gets to be very slow
  5. Improving OFFSET performance in PostgreSQL
  6. http://www.depesz.com/2011/05/20/pagination-with-fixed-order/
  7. How to get a row-by-row MySQL ResultSet in python Server Side Cursor in MySQL

Edits:

Django 1.6 is adding persistent database connections

Django Database Persistent Connections

This should facilitate, under some conditions, using cursors. Still it's outside my current skills (and time to learn) how to implement such a solution..

Also, the "better solution" definitely does not work in all situations and cannot be used as a generic approach, only a stub to be adapted case by case...

解决方案

If all you want to do is iterate over everything in the table once, the following is very efficient in resources and far faster than the basic iterator. Note that paging by primary key is necessary for efficient implementation due to the linear time of the offset operation.

def table_iterator(model, page_size=10000):
    try: max = model.objects.all().order_by("-pk")[0].pk
    except IndexError: return 
    pages = max / page_size + 1
    for page_num in range(pages):
        lower = page_num * page_size
        page = model.objects.filter(pk__gte=lower, pk__lt=lower+page_size)
        for obj in page:
            yield obj

Use looks like:

for obj in table_iterator(Model):
    # do stuff

这篇关于内存高效(恒定)和速度优化迭代在Django中的一个大表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆