在Django中优化Postgresql数据库的性能? [英] Optimizing performance of Postgresql database writes in Django?

查看:102
本文介绍了在Django中优化Postgresql数据库的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Django 1.1应用程序需要每天从一些大的json文件导入数据。要提供一个想法,其中一个文件超过100 Mb,并有90K条目导入到Postgresql数据库。



我遇到的问题是它需要真的很长一段时间才能导入数据,即以数小时的顺序。我预计会花费一些时间将数量的条目写入数据库,但肯定不会那么长,这使我觉得我在做一些固有的错误。我已经阅读了类似的stackexchange问​​题,提出的解决方案建议使用 transaction.commit_manually transaction.commit_on_success 装饰器提交分批而不是每个 .save(),我已经在做。



正如我所说,我想知道如果我做错了什么(例如,批次提交太大,外键太多...),或者我是否应该离开Django模型来执行此功能并直接使用DB API 。任何想法或建议?



以下是导入数据时处理的基本模型(我已经删除了原始代码中的一些字段,以便简单)

 类模板(models.Model):
template_name = models.TextField (_(Name),max_length = 70)
sourcepackage = models.TextField(_(Source package),max_length = 70)
translation_domain = models.TextField(_(Domain ,max_length = 70)
total = models.IntegerField(_(Total))
enabled = models.BooleanField(_(Enabled))
priority = models.IntegerField(_ (优先))
release = models.ForeignKey(Release)

类翻译(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_(Translated))
pre>

此处s代码似乎需要年龄段完成:

  @ transaction.commit_manually 
def add_translations(translation_data,lp_translation):

releases = Release.objects.all()

#有5个版本
发布版本:

#translation_data有大约90K条目
#这是在translation_data中需要很长时间
的部分lp_translation:
try:
language = Language。 object.get(
code = lp_translation ['language'])
除了Language.DoesNotExist:
continue

翻译=翻译(
template = Template .objects.get(
sourcepackage = lp_translation ['sourcepackage'],
template_name = lp_translation ['template_name'],
翻译_domain = \
lp_translation ['translation_domain'],
release = release),
translated = lp_translation ['translated'],
language = language,
release = release,


translation.save()

#我意识到我应该提交每n个条目
transaction.commit()

#我还有一些代码来填写一些数据我是
#没有从json文件

#添加缺少的模板
language = Language.objects.filter(visible = True)
languages_total = len(languages)

语言语言:
templates = Template.objects.filter(release = release )

模板中的模板:
try:
translation = Translati on.objects.get(
template = template,
language = language,
release = release)
除了Translation.DoesNotExist:
翻译=翻译(template = template ,
language = language,
release = release,
translated = 0,
untranslated = 0)
translation.save()

deal.commit()


解决方案

每一行都是将数据直接加载到服务器的很多。即使使用优化的代码。另外,一次插入/更新一行还是比一次性处理要慢一些。



如果导入文件在本地可用到服务器,您可以使用 COPY 。否则,您可以使用标准接口中的元命令 \copy psql 。您提到JSON,为了这样工作,您必须将数据转换为合适的平面格式,如CSV。



如果您只想向表格添加新行:

  COPY tbl FROM'/ absolute / path / to / file'FORMAT csv; 

或者如果要INSERT / UPDATE某些行:



首先:使用足够的RAM用于 temp_buffers (至少临时,如果可以),所以临时表不必写入磁盘。请注意,在访问 会话中的任何临时表之前,必须先完成此操作。

  SET LOCAL temp_buffers = '128MB'; 

内存中的表示比数据的on.disc表示占用更多的空间。所以对于一个100 MB的JSON文件..减去JSON开销,加上一些Postgres开销,128 MB可能或可能不够。但是你不必猜测,只是做一个测试运行并测量它:

  select pg_size_pretty(pg_total_relation_size('tmp_x' )); 

创建临时表:

  CREATE TEMP TABLE tmp_x(id int,val_a int,val_b text); 

或者,只是复制现有表的结构:

  CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0; 

复制值(应采取而不是小时):

  COPY tmp_x FROM'/ absolute / path / to / file'FORMAT csv; 

从那些使用普通旧SQL的INSERT / UPDATE。当您计划复杂的查询时,您甚至可能要在临时表中添加一个索引或两个,并运行 ANALYZE

  ANALYZE tmp_x; 

例如,更新现有行,匹配 id

 更新tbl 
SET col_a = tmp_x.col_a
使用tmp_x
WHERE tbl.id = tmp_x.id;

最后,删除临时表:

  DROP TABLE tmp_x; 

或在会话结束时自动删除。


I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.

The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.

As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?

Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)

class Template(models.Model):
    template_name = models.TextField(_("Name"), max_length=70)
    sourcepackage = models.TextField(_("Source package"), max_length=70)
    translation_domain = models.TextField(_("Domain"), max_length=70)
    total = models.IntegerField(_("Total"))
    enabled = models.BooleanField(_("Enabled"))
    priority = models.IntegerField(_("Priority"))
    release = models.ForeignKey(Release) 

class Translation(models.Model):
    release = models.ForeignKey(Release)
    template = models.ForeignKey(Template)
    language = models.ForeignKey(Language)
    translated = models.IntegerField(_("Translated"))

And here's the bit of code that seems to take ages to complete:

@transaction.commit_manually
def add_translations(translation_data, lp_translation):

    releases = Release.objects.all()

    # There are 5 releases
    for release in releases:

        # translation_data has about 90K entries
        # this is the part that takes a long time
        for lp_translation in translation_data:
            try:
                language = Language.objects.get(
                    code=lp_translation['language'])
            except Language.DoesNotExist:
                continue

            translation = Translation(
                template=Template.objects.get(
                            sourcepackage=lp_translation['sourcepackage'],
                            template_name=lp_translation['template_name'],
                            translation_domain=\
                                lp_translation['translation_domain'],
                            release=release),
                translated=lp_translation['translated'],
                language=language,
                release=release,
                )

            translation.save()

        # I realize I should commit every n entries
        transaction.commit()

        # I've also got another bit of code to fill in some data I'm
        # not getting from the json files

        # Add missing templates
        languages = Language.objects.filter(visible=True)
        languages_total = len(languages)

        for language in languages:
            templates = Template.objects.filter(release=release)

            for template in templates:
                try:
                    translation = Translation.objects.get(
                                    template=template,
                                    language=language,
                                    release=release)
                except Translation.DoesNotExist:
                    translation = Translation(template=template,
                                              language=language,
                                              release=release,
                                              translated=0,
                                              untranslated=0)
                    translation.save()

            transaction.commit()

解决方案

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.

If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.

If you just want to add new rows to a table:

COPY tbl FROM '/absolute/path/to/file' FORMAT csv;

Or if you want to INSERT / UPDATE some rows:

First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.

SET LOCAL temp_buffers='128MB';

In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:

select pg_size_pretty(pg_total_relation_size('tmp_x'));

Create the temporary table:

CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);

Or, to just duplicate the structure of an existing table:

CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;

Copy values (should take seconds, not hours):

COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;

From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:

ANALYZE tmp_x;

For instance, to update existing rows, matched by id:

UPDATE tbl
SET    col_a = tmp_x.col_a
USING  tmp_x
WHERE  tbl.id = tmp_x.id;

Finally, drop the temporary table:

DROP TABLE tmp_x;

Or have it dropped automatically at the end of the session.

这篇关于在Django中优化Postgresql数据库的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆