当需要存储重复的json文件时,有什么可行的策略来检测重复的文件? [英] What are some viable strategies to detecting duplicates in a large json file when you need to store the duplicates?

查看:107
本文介绍了当需要存储重复的json文件时,有什么可行的策略来检测重复的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在json中存储了大量数据,这些数据太大而无法加载到内存中. json字段包含有关用户的数据和一些元数据-但是,肯定有一些重复项.我想浏览该文件并进行整理,以特定方式合并重复项.

I have an extremely large set of data stored in json that is too large to load in memory. The json fields contain data about users and some metadata - however, there are certainly some duplicates. I would like to go through this file and curate it, merging the duplicates in a specific way.

但是,我不确定这样做的最佳实践是什么.我曾考虑使用Bloom过滤器,但是Bloom过滤器不会让我知道重复的是 of 的重复,因此我无法完全合并.有什么我可以阅读/查看的最佳实践是什么?有哪些行业标准?所有这些都需要在python中完成.

However, I am not sure what the best practice to do so is. I thought of using a bloom filter, but a bloom filter won't let me know what the duplicate is a duplicate of, so I cannot exactly merge. Is there something I could read/see on what the best practice for something like this is? What are some industry standards? All of this needs to be done in python.

推荐答案

您可以按哈希值将记录划分为适合内存的较小集合,删除每个集合中的重复项,然后将它们重新组合为一个文件.

You can partition the records by hash value into smaller sets that fit into memory, remove duplicates in each set, and then reassemble them back into one file.

这篇关于当需要存储重复的json文件时,有什么可行的策略来检测重复的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆