当需要存储重复的json文件时，有什么可行的策略来检测重复的文件? [英] What are some viable strategies to detecting duplicates in a large json file when you need to store the duplicates?

查看：107 发布时间：2019/11/26 21:20:17 python json data-structures

本文介绍了当需要存储重复的json文件时，有什么可行的策略来检测重复的文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在json中存储了大量数据，这些数据太大而无法加载到内存中. json字段包含有关用户的数据和一些元数据-但是，肯定有一些重复项.我想浏览该文件并进行整理，以特定方式合并重复项.

I have an extremely large set of data stored in json that is too large to load in memory. The json fields contain data about users and some metadata - however, there are certainly some duplicates. I would like to go through this file and curate it, merging the duplicates in a specific way.

但是，我不确定这样做的最佳实践是什么.我曾考虑使用Bloom过滤器，但是Bloom过滤器不会让我知道重复的是 of 的重复，因此我无法完全合并.有什么我可以阅读/查看的最佳实践是什么?有哪些行业标准?所有这些都需要在python中完成.

However, I am not sure what the best practice to do so is. I thought of using a bloom filter, but a bloom filter won't let me know what the duplicate is a duplicate of, so I cannot exactly merge. Is there something I could read/see on what the best practice for something like this is? What are some industry standards? All of this needs to be done in python.

当需要存储重复的json文件时，有什么可行的策略来检测重复的文件? [英] What are some viable strategies to detecting duplicates in a large json file when you need to store the duplicates?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

当需要存储重复的json文件时，有什么可行的策略来检测重复的文件? [英] What are some viable strategies to detecting duplicates in a large json file when you need to store the duplicates?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭