仅将Java中带有URL的两个文本大文件与带有外部存储器的URL进行比较? [英] Comparing two text large files with URLs in Java with external memory only?
问题描述
我有以下情况:
- 网址文本文件A
- 网址文本文件B
每个文件的大小约为4Gb.
Each file's size is around 4Gb.
我需要计算:
- A中所有不在B中的网址
- B中所有不在A中的网址
我在网上找到的所有Java-diff示例都将整个列表加载到内存中(使用Map或使用MMap解决方案).我的系统没有交换功能,并且缺少内存,没有外部存储器就可以执行此操作.
All of the Java-diff examples I'm finding online load the entire list in memory (either with a Map or using an MMap solution). My system doesn't have swap and lacks the memory to be able to do this without External-Memory.
有人知道解决方案吗?
该项目可以进行大量文件排序,而不会占用大量内存 https://github.com/lemire/外部排序Java
This project can do huge file sorts without eating up tons of memory https://github.com/lemire/externalsortinginjava
我正在寻找类似的东西,但是会产生差异.我将从尝试使用该项目作为基准来实现这一点开始.
I am looking for something similar, but for generating diffs. I'm going to start by trying to implement this using that project as a baseline.
推荐答案
如果系统具有足够的存储空间,则可以通过DB执行此操作.例如:
If system has enough storage, you can do this via DB. For example :
创建H2或sqlite DB(存储在磁盘上的数据,分配尽可能多的数据)缓存,因为系统可以负担得起)在表A和B中加载文本文件(在"URL"列上创建索引)
Create an H2 or sqlite DB (data stored on disk, allocate as much cache as system can afford) Load text file in tables A and B (create index on 'url' column)
select url from A where URL not in (select distinct url from B)
select url from B where URL not in (select distinct url from A)
这篇关于仅将Java中带有URL的两个文本大文件与带有外部存储器的URL进行比较?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!