优化 Django 中 Postgresql 数据库写入的性能? [英] Optimizing performance of Postgresql database writes in Django?

查看:24
本文介绍了优化 Django 中 Postgresql 数据库写入的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Django 1.1 应用程序,它需要每天从一些大型 json 文件中导入数据.举个例子,其中一个文件超过 100 Mb,有 90K 条目导入 Postgresql 数据库.

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.

我遇到的问题是导入数据需要很长时间,即几个小时.我原以为将这么多条目写入数据库需要一些时间,但肯定不会那么长,这让我觉得我做的事情本质上是错误的.我读过类似的 stackexchange 问题,建议的解决方案建议使用 transaction.commit_manuallytransaction.commit_on_success 装饰器批量提交而不是在每个 .save(),我已经在做.

The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.

正如我所说,我想知道我是否做错了什么(例如,要提交的批次太大?外键太多?...),或者我是否应该为此放弃 Django 模型函数并直接使用 DB API.有什么想法或建议吗?

As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?

以下是我在导入数据时处理的基本模型(为了简单起见,我已经删除了原始代码中的一些字段)

Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)

class Template(models.Model):
    template_name = models.TextField(_("Name"), max_length=70)
    sourcepackage = models.TextField(_("Source package"), max_length=70)
    translation_domain = models.TextField(_("Domain"), max_length=70)
    total = models.IntegerField(_("Total"))
    enabled = models.BooleanField(_("Enabled"))
    priority = models.IntegerField(_("Priority"))
    release = models.ForeignKey(Release) 

class Translation(models.Model):
    release = models.ForeignKey(Release)
    template = models.ForeignKey(Template)
    language = models.ForeignKey(Language)
    translated = models.IntegerField(_("Translated"))

这里是一段似乎需要很长时间才能完成的代码:

And here's the bit of code that seems to take ages to complete:

@transaction.commit_manually
def add_translations(translation_data, lp_translation):

    releases = Release.objects.all()

    # There are 5 releases
    for release in releases:

        # translation_data has about 90K entries
        # this is the part that takes a long time
        for lp_translation in translation_data:
            try:
                language = Language.objects.get(
                    code=lp_translation['language'])
            except Language.DoesNotExist:
                continue

            translation = Translation(
                template=Template.objects.get(
                            sourcepackage=lp_translation['sourcepackage'],
                            template_name=lp_translation['template_name'],
                            translation_domain=
                                lp_translation['translation_domain'],
                            release=release),
                translated=lp_translation['translated'],
                language=language,
                release=release,
                )

            translation.save()

        # I realize I should commit every n entries
        transaction.commit()

        # I've also got another bit of code to fill in some data I'm
        # not getting from the json files

        # Add missing templates
        languages = Language.objects.filter(visible=True)
        languages_total = len(languages)

        for language in languages:
            templates = Template.objects.filter(release=release)

            for template in templates:
                try:
                    translation = Translation.objects.get(
                                    template=template,
                                    language=language,
                                    release=release)
                except Translation.DoesNotExist:
                    translation = Translation(template=template,
                                              language=language,
                                              release=release,
                                              translated=0,
                                              untranslated=0)
                    translation.save()

            transaction.commit()

推荐答案

通过你的应用程序并处理每一行,将数据直接加载到服务器的速度要很多.即使使用优化的代码.此外,一次插入/更新一行比一次全部处理慢很多.

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.

如果导入文件在服务器本地可用,您可以使用 <强>复制.否则,您可以在标准界面 psql.您提到了 JSON,为此,您必须将数据转换为合适的平面格式,如 CSV.

If the import files are available locally to the server you can use COPY. Else you could use the meta command copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.

如果您只想向表中添加新行:

If you just want to add new rows to a table:

COPY tbl FROM '/absolute/path/to/file' FORMAT csv;

或者如果你想插入/更新一些行:

Or if you want to INSERT / UPDATE some rows:

首先:为 temp_buffers(至少暂时,如果可以的话)所以临时表不必写入磁盘.请注意,这必须在访问会话中的任何临时表之前完成.

First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.

SET LOCAL temp_buffers='128MB';

内存中的表示比磁盘上的数据表示占用更多的空间.因此,对于 100 MB 的 JSON 文件 .. 减去 JSON 开销,加上一些 Postgres 开销,128 MB 可能足够也可能不够.但是你不用去猜测,只需进行测试运行并测量它:

In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:

select pg_size_pretty(pg_total_relation_size('tmp_x'));

创建临时表:

CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);

或者,只是复制现有表的结构:

Or, to just duplicate the structure of an existing table:

CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;

复制值(应该花费,而不是几小时):

Copy values (should take seconds, not hours):

COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;

从那里使用普通的旧 SQL 插入/更新.当您计划一个复杂的查询时,您甚至可能想在临时表上添加一两个索引并运行ANALYZE:

From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:

ANALYZE tmp_x;

例如,更新现有的行,由 id 匹配:

For instance, to update existing rows, matched by id:

UPDATE tbl
SET    col_a = tmp_x.col_a
USING  tmp_x
WHERE  tbl.id = tmp_x.id;

最后,删除临时表:

DROP TABLE tmp_x;

或者在会话结束时自动删除.

Or have it dropped automatically at the end of the session.

这篇关于优化 Django 中 Postgresql 数据库写入的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆