去重 ArangoDB 文档集合 [英] deduplicating ArangoDB document collection

查看:33
本文介绍了去重 ArangoDB 文档集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我确信有一种简单快捷的方法可以做到这一点,但它让我望而却步.我有一个包含一些重复记录的大型数据集,我想删除重复项.(重复项由一个属性唯一标识,但文档的其余部分也应该相同).

I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).

我尝试创建一个新的集合,该集合仅通过几种不同的方式具有唯一值,但它们都非常缓慢.例如:

I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:

FOR doc IN Documents
    COLLECT docId = doc.myId, doc2 = doc
    INSERT doc2 IN Documents2

FOR doc IN Documents
    LET existing = (FOR doc2 IN Documents2
        FILTER doc.myId == doc2.myId
        RETURN doc2)
    UPDATE existing WITH doc IN Documents2

或(这给了我一个违反唯一约束"的错误)

or (this gives me a "violated unique constraint" error)

FOR doc IN Documents
    UPSERT {myId: doc.myId}}]}
    INSERT doc
    UPDATE doc IN Documents2

推荐答案

TL;DR

删除重复记录并将它们写入另一个集合不需要那么长时间(不到 60 秒),至少在我的台式机上(Windows 10,Intel 6700K 4x4.0GHz), 32GB 内存,Evo 850 SSD).

TL;DR

It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).

某些查询需要适当的索引,否则它们将永远存在.索引需要一些内存,但与查询执行期间对记录进行分组所需的内存相比,它可以忽略不计.如果内存不足,性能将受到影响,因为操作系统需要在内存和大容量存储之间交换数据.这对于旋转磁盘尤其是一个问题,而不是快速闪存设备.

Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.

我生成了 220 万条记录,其中包含 5-20 个随机属性和每个属性 160 个字符的乱码.此外,每条记录都有一个属性myid.187k 条记录有一个唯一的 id,60k myids 存在两次,70k 次存在 3 次.集合大小报告为 4.83GB:

I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:

// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
    LET randomAttributes = MERGE(
        FOR j IN 1..FLOOR(RAND() * 15) + 5
            RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
    )
    INSERT MERGE(randomAttributes, {myid: i}) INTO test1

启动 ArangoDB 前的内存消耗为 3.4GB,启动后为 4.0GB,加载 test1 源代码集合后大约为 8.8GB.

Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.

test1 读取并将所有文档 (2.2m) 插入 test2 在我的系统上花费了 20 秒,内存峰值约为 17.6国标:

Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:

FOR doc IN test1
    INSERT doc INTO test2

myid 分组而不写入大约需要.对我来说是 9 秒,查询期间 9GB RAM 峰值:

Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:

LET result = (
    FOR doc IN test1
        COLLECT myid = doc.myid
        RETURN 1
)
RETURN LENGTH(result)

分组失败

我在只有 3 个记录和一个重复的 myid 的数据集上尝试了您的 COLLECT docId = doc.myId, doc2 = doc 方法.它表明查询实际上并没有分组/删除重复项.因此,我尝试寻找替代查询.

Failed grouping

I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.

要将重复的 myid 组合在一起但保留访问完整文档的可能性,可以使用 COLLECT ... INTO.我只是选择了每个组的第一个文档来删除多余的 myids.将具有唯一 myid 属性的 2m 记录写入 test2 的查询需要大约 40 秒.我没有准确测量内存消耗,但我看到了 14GB 到 21GB 的不同内存峰值.也许截断测试集合并重新运行查询会增加所需的内存,因为一些过时的条目会以某种方式阻碍(压缩/密钥生成)?

To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?

FOR doc IN test1
    COLLECT myid = doc.myid INTO groups
    INSERT groups[0].doc INTO test2

使用子查询分组

以下查询显示了更稳定的内存消耗,峰值为 13.4GB:

Grouping with subquery

The following query showed a more stable memory consumption, peaking at 13.4GB:

FOR doc IN test1
    COLLECT myid = doc.myid
    LET doc2 = (
        FOR doc3 IN test1
            FILTER doc3.myid == myid
            LIMIT 1
            RETURN doc3
    )
    INSERT doc2[0] INTO test2

但是请注意,它需要 test1 中的 myid 上的哈希索引才能实现 ~38 秒的查询执行时间.否则子查询将导致数百万次集合扫描并需要很长时间.

Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.

我们可以只将 _id 分配给一个变量,然后 KEEP 将其分配给一个变量,这样我们就可以查找文档正文,而不是存储属于一个组的整个文档使用 DOCUMENT():

Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():

FOR doc IN test1
    LET d = doc._id
    COLLECT myid = doc.myid INTO groups KEEP d
    INSERT DOCUMENT(groups[0].d) INTO test2

内存使用:加载源集合后8.1GB,查询期间13.5GB峰值.200 万条记录只用了 30 秒

Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!

出于好奇,我还尝试了投影,而不是 KEEP:

Instead of KEEP I also tried a projection out of curiosity:

FOR doc IN test1
    COLLECT myid = doc.myid INTO groups = doc._id
    INSERT DOCUMENT(groups[0]) INTO test2

加载 test1 后 RAM 为 8.3GB,峰值为 17.8GB(在查询执行期间实际上有两个峰值,均超过 17GB).完成 200 万条记录需要 35 秒.

RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.

我用 UPSERT 尝试了一些东西,但看到了一些奇怪的结果.结果证明这是对 ArangoDB 的 upsert 实施的疏忽.v3.0.2 包含修复,我现在得到正确的结果:

I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:

FOR doc IN test1
    UPSERT {myid: doc.myid}
    INSERT doc
    UPDATE {} IN test2

处理test2myid的(唯一)哈希索引需要40秒,RAM峰值约为13.2GB.

It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.

我首先将所有文档从 test1 复制到 test2(2.2m 记录),然后我尝试 REMOVE 只是 中的重复项>test2:

I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:

FOR doc IN test2
    COLLECT myid = doc.myid INTO keys = doc._key
    LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
    FOR k IN allButFirst
        REMOVE k IN test2

内存为 8.2GB(仅加载 test2),在查询期间增加到 13.5GB.删除重复项 (200k) 大约需要 16 秒.

Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).

以下查询将 myid 组合在一起并汇总每个 id 出现的频率.对目标集合test2运行,结果应该是{"1":2000000},否则还是有重复的.我仔细检查了上面的查询结果并检查了所有内容.

The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.

FOR doc IN test2
    COLLECT myid = doc.myid WITH COUNT INTO count
    COLLECT c = count WITH COUNT INTO cc
    RETURN {[c]: cc}

结论

ArangoDB v3.0 的性能似乎是合理的,尽管如果没有足够的可用 RAM 可能会降低性能.不同的查询大致在同一时间内完成,但表现出不同的 RAM 使用特征.对于某些查询,索引对于避免高计算复杂度是必要的(此处:全集合扫描;最坏情况下为 2,200,000,000,000,000 次读取?).

Conclusion

The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).

你能在你的数据上尝试我提出的解决方案并检查你的机器的性能吗?

Can you try my presented solutions on your data and check what the performance is on your machine?

这篇关于去重 ArangoDB 文档集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆