如何比较两个大型CSV文件并获取差异文件 [英] How to compare two large CSV files and get the difference file

查看:971
本文介绍了如何比较两个大型CSV文件并获取差异文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要逐行比较2 csv(巨大文件),并将差异行写在单独的文件中。
一个文件中的行可以出现在第二个文件中的任何位置。我需要比较整行。
任何指针?

I need to compare 2 csv (huge files) row by row and write the difference rows in a separate file. The row in one file can be present anywhere in second file. I need to compare the entire row. Any pointers?

推荐答案

一个常见的方法是计算一个文件中每行的哈希码较小的一个)。然后将整个文件放入哈希表。这将是较小文件的索引。

One common approach is to calculate hash code for each of the rows in one file (preferably the smaller one). Then put entire file into a hashtable. This will be the index of the smaller file.

之后,浏览较大的文件。对于每一行计算其哈希。然后查看索引。如果没有这样的哈希码,那么这一行是差别。否则,如果存在这样的散列码(可能多于一行将具有相同的散列),则执行源行与散列表中的所有冲突行的整个比较,并且查看是否存在重复。

After that, walk through the larger file. For each row calculate its hash. Then look into the index. If there is no such hash code there, then this row is the difference. Otherwise, if there is such hash code (possibly more than one row will have the same hash there), then perform entire comparison of the source row with all colliding rows in the hash table and see if there is the duplicate.

现在,如果没有重复,那么源文件中的行再次是唯一的,并推送到输出。

Now, if there is no duplicate, then the row in the source file is again unique and push it to the output.

否则,如果有重复,您可能希望从哈希表中删除该重复,并跳过输入行。这意味着来自两个文件的两行被检测为相等,并且将彼此取消。

Otherwise, if there is a duplicate, you may wish to remove that duplicate from the hash table and skip the input row. That means that the two rows from the two files have been detected as equal and will cancel each other.

当你完成了遍历更大的文件,你需要决定什么与散列表中的剩余行有关。你可能想把所有的输出到输出,因为那些是在其他文件中不存在的行。

When you finish walking through the larger file, you need to decide what to do with the remaining rows in the hashtable. You probably want to push all of them to the output as well, because those were the rows that didn't exist in the other file.

现在我将尝试概述伪代码:

Now I'll try to outline the pseudocode:

dict = new dictionary<code, list<row>>

-- Indexing phase
foreach row in file1
    code = hash(row)
    if dict.contains(code) then
        dict[hash].add(row)
    else
        dict[hash] = new list(row)

-- Comparison phase
foreach row in file2
    code = hash(row)
    bool unique = true
    if dict.contains(code) then
        foreach indexedRow in dict[code]
            if indexedRow is the same as row then
                begin
                    unique = false
                    remove indexedRow from dict[code]
                end
    if unique then
        push row to output

-- Finalization phase
foreach row in dict
    push row to output

此解决方案是其运行时复杂度为O(M + N),其中M和N是每个文件中的行数。它的缺点是它需要O(min(M,N))内存为索引。

Greatest quality of this solution is that its run time complexity is O(M + N), where M and N are the number of rows in each of the files. Its drawback is that it takes O(min(M, N)) memory for the index.

这篇关于如何比较两个大型CSV文件并获取差异文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆