在mongodb中查找重复的URL [英] Find duplicate urls in mongodb
问题描述
我有一个新闻文章数据库,我正在尝试做一些数据库清理.我想找到所有重复的文档,以及我认为通过使用url字段来完成此操作的最佳方法.我的文档结构如下:
I have a DB with news articles, and I am trying to do a little DB cleaning. I want to find all duplicate documents, and the best way i think to accomplish this by using the url field. My documents are structured as follows:
{
_id:
author:
title:
description:
url:
urlToImage:
publishedAt:
content:
summarization:
source_id:
}
非常感谢您的帮助
推荐答案
假定具有 name
(使用 name
而不是 url
)字段,其中包含重复值.我有两个聚合,它们返回一些可用于进行进一步处理的输出.希望您会觉得有用.
Assuming a collection documents with name
(using name
instead of url
) field consisting duplicate values. I have two aggregations which return some output which can be used to do further processing. I hope you will find this useful.
{ _id: 1, name: "jack" },
{ _id: 2, name: "john" },
{ _id: 3, name: "jim" },
{ _id: 4, name: "john" }
{ _id: 5, name: "john" },
{ _id: 6, name: "jim" }
请注意,约翰"出现3次,吉姆"出现2次.
Note that "john" has 3 occurrances and "jim" has 2.
(1)此聚合返回具有重复项(多次出现)的名称:
(1) This aggregation returns the names which have duplicates (more than one occurance):
db.collection.aggregate( [
{
$group: {
_id: "$name",
count: { $sum: 1 }
}
},
{
$group: {
_id: "duplicate_names",
names: { $push: { $cond: [ { $gt: [ "$count", 1 ] }, "$_id", "$DUMMY" ] } }
}
}
] )
输出:
{ "_id" : "duplicate_names", "names" : [ "john", "jim" ] }
(2)以下聚合仅返回重复文档的 _id
字段值.例如,名称"jim"具有 _id
值 3
和 6
.输出中只有重复文档的ID,即 6
.
(2) The following aggregation just returns the _id
field values for the duplicate documents. For example, the name "jim" has _id
values 3
and 6
. The output has only the id's for the duplicate documents, i.e., 6
.
db.colection.aggregate( [
{
$group: {
_id: "$name",
count: { $sum: 1 },
ids: { $push: "$_id" }
}
},
{
$group: {
_id: "duplicate_ids",
ids: { $push: { $slice: [ "$ids", 1, 9999 ] } }
}
},
{
$project: {
ids: {
$reduce: {
input: "$ids",
initialValue: [ ],
in: { $concatArrays: [ "$$this", "$$value" ] }
}
}
}
}
] )
输出:
{ "_id" : duplicate_ids", "ids" : [ 6, 4, 5 ] }
这篇关于在mongodb中查找重复的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!