删除 Django ORM 中的重复项——多行 [英] Remove duplicates in Django ORM -- multiple rows

查看:57
本文介绍了删除 Django ORM 中的重复项——多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含四个字段的模型.如何从我的数据库中删除重复的对象?

I have a model that has four fields. How do I remove duplicate objects from my database?

Daniel Roseman 对这个问题的回答似乎很合适,但我不确定如何将其扩展到每个对象有四个字段要比较的情况.

Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.

谢谢,

W.

推荐答案

def remove_duplicated_records(model, fields):
    """
    Removes records from `model` duplicated on `fields`
    while leaving the most recent one (biggest `id`).
    """
    duplicates = model.objects.values(*fields)

    # override any model specific ordering (for `.annotate()`)
    duplicates = duplicates.order_by()

    # group by same values of `fields`; count how many rows are the same
    duplicates = duplicates.annotate(
        max_id=models.Max("id"), count_id=models.Count("id")
    )

    # leave out only the ones which are actually duplicated
    duplicates = duplicates.filter(count_id__gt=1)

    for duplicate in duplicates:
        to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})

        # leave out the latest duplicated record
        # you can use `Min` if you wish to leave out the first record
        to_delete = to_delete.exclude(id=duplicate["max_id"])

        to_delete.delete()

你不应该经常这样做.改为对数据库使用 unique_together 约束.

You shouldn't do it often. Use unique_together constraints on database instead.

这会在数据库中留下最大 id 的记录.如果要保留原始记录(第一个),请使用models.Min 稍微修改代码.你也可以使用完全不同的字段,比如创建日期什么的.

This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.

底层 SQL

注释 django ORM 时,对查询中使用的所有模型字段使用 GROUP BY 语句.因此使用 .values() 方法.GROUP BY 将对具有相同值的所有记录进行分组.重复的(unique_fields 的多个id)稍后会在 .filter() 生成的 HAVING 语句中被过滤掉code> 在带注释的 QuerySet 上.

When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.

SELECT
    field_1,
    …
    field_n,
    MAX(id) as max_id,
    COUNT(id) as count_id
FROM
    app_mymodel
GROUP BY
    field_1,
    …
    field_n
HAVING
    count_id > 1

重复的记录稍后会在 for 循环中删除,但每个组中最频繁的记录除外.

The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.

空的 .order_by()

可以肯定的是,在聚合 QuerySet 之前添加一个空的 .order_by() 调用总是明智的.

Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.

用于对QuerySet 进行排序的字段也包含在GROUP BY 语句中.空 .order_by() 覆盖模型的 Meta 中声明的列,结果它们不包含在 SQL 查询中(例如,按日期默认排序可能会破坏结果).

The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).

您目前可能不需要覆盖它,但有人可能会在以后添加默认排序,从而在不知道的情况下破坏您宝贵的删除重复代码.是的,我确定你有 100% 的测试覆盖率......

You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…

只需添加空的 .order_by() 以确保安全.;-)

Just add empty .order_by() to be safe. ;-)

https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by

交易

当然,您应该考虑在单个事务中完成所有操作.

Of course you should consider doing it all in a single transaction.

https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic

这篇关于删除 Django ORM 中的重复项——多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆