从 MongoDB 中删除重复项 [英] Remove Duplicates from MongoDB

查看:32
本文介绍了从 MongoDB 中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 mongodb(复制)中有大约 500 万个文档,每个文档有 43 个字段.如何删除重复的文档.我试过了

hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed

db.testkdd.ensureIndex({
        duration  : 1 , protocol_type  : 1 , service  : 1 ,
        flag  : 1 , src_bytes  : 1 , dst_bytes  : 1 ,
        land  : 1 , wrong_fragment  : 1 , urgent  : 1 ,
        hot  : 1 , num_failed_logins  : 1 , logged_in  : 1 ,
        num_compromised  : 1 , root_shell  : 1 , su_attempted  : 1 ,
        num_root  : 1 , num_file_creations  : 1 , num_shells  : 1 ,
        num_access_files  : 1 , num_outbound_cmds  : 1 , is_host_login  : 1 ,
        is_guest_login  : 1 , count  : 1 ,  srv_count  : 1 ,
        serror_rate  : 1 , srv_serror_rate  : 1 , rerror_rate  : 1 ,
        srv_rerror_rate  : 1 , same_srv_rate  : 1 , diff_srv_rate  : 1 ,
        srv_diff_host_rate  : 1 , dst_host_count  : 1 , dst_host_srv_count  : 1 ,
        dst_host_same_srv_rate  : 1 , dst_host_diff_srv_rate  : 1 ,
        dst_host_same_src_port_rate  : 1 ,  dst_host_srv_diff_host_rate  : 1 ,
        dst_host_serror_rate  : 1 , dst_host_srv_serror_rate  : 1 ,
        dst_host_rerror_rate  : 1 , dst_host_srv_rerror_rate  : 1 , lable  : 1 
    },
    {unique: true, dropDups: true}
)

运行此代码我得到一个错误 "errmsg" : "从索引生成的命名空间名称..

run this code i get a error "errmsg" : "namespace name generated from index ..

{
    "ok" : 0,
    "errmsg" : "namespace name generated from index name "project.testkdd.$duration_1_protocol_type_1_service_1_flag_1_src_bytes_1_dst_bytes_1_land_1_wrong_fragment_1_urgent_1_hot_1_num_failed_logins_1_logged_in_1_num_compromised_1_root_shell_1_su_attempted_1_num_root_1_num_file_creations_1_num_shells_1_num_access_files_1_num_outbound_cmds_1_is_host_login_1_is_guest_login_1_count_1_srv_count_1_serror_rate_1_srv_serror_rate_1_rerror_rate_1_srv_rerror_rate_1_same_srv_rate_1_diff_srv_rate_1_srv_diff_host_rate_1_dst_host_count_1_dst_host_srv_count_1_dst_host_same_srv_rate_1_dst_host_diff_srv_rate_1_dst_host_same_src_port_rate_1_dst_host_srv_diff_host_rate_1_dst_host_serror_rate_1_dst_host_srv_serror_rate_1_dst_host_rerror_rate_1_dst_host_srv_rerror_rate_1_lable_1" is too long (127 byte max)",
    "code" : 67
}

如何解决问题?

推荐答案

从 MongoDB 2.6 和 语法已被弃用"-compatibility/#remove-dropdups-option" rel="nofollow">在 MongoDB 3.0 中删除.在大多数情况下使用它不是一个好主意,因为删除"是任意的,并且可以删除任何重复".这意味着被删除"的内容可能并不是您真正想要删除的内容.

The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. It is not a very good idea in most cases to use this as the "removal" is arbitrary and any "duplicate" could be removed. Which means what gets "removed" may not be what you really want removed.

无论如何,您遇到了索引长度"错误,因为此处索引键的值会更长,这是允许的.一般来说,您并不是有意"在任何普通应用程序中索引 43 个字段.

Anyhow, you are running into an "index length" error since the value of the index key here would be longer that is allowed. Generally speaking, you are not "meant" to index 43 fields in any normal application.

如果您想从集合中删除重复",那么最好的办法是运行聚合查询以确定哪些文档包含重复"数据,然后在该列表中循环删除除一个"之外的所有"来自目标集合的唯一"_id 值.这可以通过 "Bulk" 操作来实现,以实现最高效率.

If you want to remove the "duplicates" from a collection then your best bet is to run an aggregation query to determine which documents contain "duplicate" data and then cycle through that list removing "all but one" of the already "unique" _id values from the target collection. This can be done with "Bulk" operations for maximum efficiency.

注意:我确实很难相信您的文档实际上包含 43 个唯一"字段.很可能所有你需要的"只是简单地确定使文档唯一"的那些字段,然后按照下面概述的过程进行操作:

NOTE: I do find it hard to believe that your documents actually contain 43 "unique" fields. It is likely that "all you need" is to simply identify only those fields that make the document "unique" and then follow the process as outlined below:

var bulk = db.testkdd.initializeOrderedBulkOp(),
    count = 0;

// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.testkdd.aggregate([
    { "$group": {
        "_id": {
           "duration" : "$duration",
          "protocol_type": "$protocol_type", 
          "service": "$service",
          "flag": "$flag"
        },
        "ids": { "$push": "$_id" },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
    doc.ids.shift();     // remove first match
    bulk.find({ "_id": { "$in": doc.ids } }).remove();  // removes all $in list
    count++;

    // Execute 1 in 1000 and re-init
    if ( count % 1000 == 0 ) {
       bulk.execute();
       bulk = db.testkdd.initializeOrderedBulkOp();
    }
});

if ( count % 1000 != 0 ) 
    bulk.execute();

如果您的 MongoDB 版本低于"2.6 并且没有批量操作,那么您可以尝试使用标准的 .remove() 也在循环内.还要注意 .aggregate() 不会在此处返回游标,循环必须更改为:

If you have a MongoDB version "lower" than 2.6 and don't have bulk operations then you can try with standard .remove() inside the loop as well. Also noting that .aggregate() will not return a cursor here and the looping must change to:

db.testkdd.aggregate([
   // pipeline as above
]).result.forEach(function(doc) {
    doc.ids.shift();  
    db.testkdd.remove({ "_id": { "$in": doc.ids } });
});

但请务必仔细查看您的文档,并且仅仅"包含您希望成为分组 _id 一部分的唯一"字段.否则你最终什么也不会删除,因为那里没有重复项.

But do make sure to look at your documents closely and only include "just" the "unique" fields you expect to be part of the grouping _id. Otherwise you end up removing nothing at all, since there are no duplicates there.

这篇关于从 MongoDB 中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆