删除 Django ORM 中的重复项——多行 [英] Remove duplicates in Django ORM -- multiple rows
问题描述
我有一个包含四个字段的模型.如何从我的数据库中删除重复的对象?
I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman 对这个问题的回答似乎很合适,但我不确定如何将其扩展到每个对象有四个字段要比较的情况.
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
谢谢,
W.
推荐答案
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
你不应该经常这样做.改为对数据库使用 unique_together
约束.
You shouldn't do it often. Use unique_together
constraints on database instead.
这会在数据库中留下最大 id
的记录.如果要保留原始记录(第一个),请使用models.Min
稍微修改代码.你也可以使用完全不同的字段,比如创建日期什么的.
This leaves the record with the biggest id
in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min
. You can also use completely different field, like creation date or something.
底层 SQL
注释 django ORM 时,对查询中使用的所有模型字段使用 GROUP BY
语句.因此使用 .values()
方法.GROUP BY
将对具有相同值的所有记录进行分组.重复的(unique_fields
的多个id
)稍后会在 .filter()
生成的 HAVING
语句中被过滤掉code> 在带注释的 QuerySet
上.
When annotating django ORM uses GROUP BY
statement on all model fields used in the query. Thus the use of .values()
method. GROUP BY
will group all records having those values identical. The duplicated ones (more than one id
for unique_fields
) are later filtered out in HAVING
statement generated by .filter()
on annotated QuerySet
.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
重复的记录稍后会在 for
循环中删除,但每个组中最频繁的记录除外.
The duplicated records are later deleted in the for
loop with an exception to the most frequent one for each group.
空的 .order_by()
可以肯定的是,在聚合 QuerySet
之前添加一个空的 .order_by()
调用总是明智的.
Just to be sure, it's always wise to add an empty .order_by()
call before aggregating a QuerySet
.
用于对QuerySet
进行排序的字段也包含在GROUP BY
语句中.空 .order_by()
覆盖模型的 Meta
中声明的列,结果它们不包含在 SQL 查询中(例如,按日期默认排序可能会破坏结果).
The fields used for ordering the QuerySet
are also included in GROUP BY
statement. Empty .order_by()
overrides columns declared in model's Meta
and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
您目前可能不需要覆盖它,但有人可能会在以后添加默认排序,从而在不知道的情况下破坏您宝贵的删除重复代码.是的,我确定你有 100% 的测试覆盖率......
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
只需添加空的 .order_by()
以确保安全.;-)
Just add empty .order_by()
to be safe. ;-)
交易
当然,您应该考虑在单个事务中完成所有操作.
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
这篇关于删除 Django ORM 中的重复项——多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!