如何比较两个大型 CSV 文件并获取差异文件 [英] How to compare two large CSV files and get the difference file

查看:34
本文介绍了如何比较两个大型 CSV 文件并获取差异文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要逐行比较 2 个 csv(大文件)并将不同的行写入单独的文件中.一个文件中的行可以出现在第二个文件的任何位置.我需要比较整行.有什么指点吗?

I need to compare 2 csv (huge files) row by row and write the difference rows in a separate file. The row in one file can be present anywhere in second file. I need to compare the entire row. Any pointers?

推荐答案

一种常见的方法是计算一个文件中每一行的哈希码(最好是较小的那个).然后将整个文件放入哈希表中.这将是较小文件的索引.

One common approach is to calculate hash code for each of the rows in one file (preferably the smaller one). Then put entire file into a hashtable. This will be the index of the smaller file.

之后,浏览较大的文件.对于每一行计算其哈希值.然后查看索引.如果那里没有这样的哈希码,那么这一行就是区别.否则,如果存在这样的哈希码(可能不止一行会有相同的哈希),则将源行与哈希表中的所有冲突行进行整体比较,看看是否有重复.

After that, walk through the larger file. For each row calculate its hash. Then look into the index. If there is no such hash code there, then this row is the difference. Otherwise, if there is such hash code (possibly more than one row will have the same hash there), then perform entire comparison of the source row with all colliding rows in the hash table and see if there is the duplicate.

现在,如果没有重复,则源文件中的行再次唯一并将其推送到输出.

Now, if there is no duplicate, then the row in the source file is again unique and push it to the output.

否则,如果存在重复项,您可能希望从哈希表中删除该重复项并跳过输入行.这意味着两个文件中的两行已被检测为相等并将相互取消.

Otherwise, if there is a duplicate, you may wish to remove that duplicate from the hash table and skip the input row. That means that the two rows from the two files have been detected as equal and will cancel each other.

当您浏览完较大的文件后,您需要决定如何处理散列表中的剩余行.您可能还想将它们全部推送到输出,因为这些行在其他文件中不存在.

When you finish walking through the larger file, you need to decide what to do with the remaining rows in the hashtable. You probably want to push all of them to the output as well, because those were the rows that didn't exist in the other file.

现在我将尝试概述伪代码:

Now I'll try to outline the pseudocode:

dict = new dictionary<code, list<row>>

-- Indexing phase
foreach row in file1
    code = hash(row)
    if dict.contains(code) then
        dict[hash].add(row)
    else
        dict[hash] = new list(row)

-- Comparison phase
foreach row in file2
    code = hash(row)
    bool unique = true
    if dict.contains(code) then
        foreach indexedRow in dict[code]
            if indexedRow is the same as row then
                begin
                    unique = false
                    remove indexedRow from dict[code]
                end
    if unique then
        push row to output

-- Finalization phase
foreach row in dict
    push row to output

这个解决方案的最大优点是它的运行时间复杂度是 O(M + N),其中 M 和 N 是每个文件中的行数.它的缺点是索引需要 O(min(M, N)) 内存.

Greatest quality of this solution is that its run time complexity is O(M + N), where M and N are the number of rows in each of the files. Its drawback is that it takes O(min(M, N)) memory for the index.

这篇关于如何比较两个大型 CSV 文件并获取差异文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆