在两个不同的txt文件窗口中比较两列 [英] Compare two columns in two different txt files windows

查看:290
本文介绍了在两个不同的txt文件窗口中比较两列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个非常大的.txt文件(〜500k行)。我需要从两个文件(通过列名称)两个列,并将它们彼此进行比较(类似于LEFT JOIN在SQL中的工作方式)。所以我需要输出到第三个txt / csv文件中来自第一个文件中不存在于第二个文件中的两列中的值的所有组合。

I have two very large .txt files (~500k rows). I need to take two columns out of both files (by column names) and compare them against each other (similar to the way LEFT JOIN works in SQL). So I need to output to the third txt/csv file all combinations of values from two columns from the first file which do not exist in the second file.

我需要自动化这个过程,所以我应该能够从命令行调用它。如果任何人可以指向我的方向正确,我会真的很感激。

I will need to automate this process, so I should be able to call it from command line. If anyone can point me to in the right direction, I would really appreciate it.

UPDATE
文件的格式完全相同,所需的列从不为空。

UPDATE The format of the files is exactly the same and the needed columns are never empty.

示例

第一个文件

DataSource;顾客;市;映射; SugGroup

ARTS;约翰;伦敦;约翰尼伦敦客户

ARTS;克里斯;慕尼黑; Jons;德国

FEDS;玛丽;伦敦;詹姆士;德国

DataSource; Customer; City; Mapping; SugGroup
ARTS; John; London; Johny; LondonCustomers
ARTS; Chris; Munich; Jons; Germany
FEDS; Mary; London; James; Germany

第二个档案

DataSource;顾客;市;映射; SugGroup

ARTS;克里斯;慕尼黑; Jons;德国

FEDS;玛丽;伦敦;詹姆士;德国

DataSource; Customer; City; Mapping; SugGroup
ARTS; Chris; Munich; Jons; Germany
FEDS; Mary; London; James; Germany

我需要做的是获取两列:客户和映射。并找到第一个文件中的行,而不是第二个。因此在给定的示例中,输出文件将如下所示:

What I need to do is to take two columns: Customer and Mappings. And find rows that are in the first file and not in the second one. So in the given example, the output file would look like:

输出文件:

映射

John; Johny

Customer; Mapping
John; Johny

推荐答案

我建议反对 Import-CSV 在100+ Mb系列上的文件工作不正常。好,它的工作,但是狗慢。

I'd advice against Import-CSV, as it doesn't work too well with files on the 100+ Mb range. Well, it works, but is dog slow.

创建哈希表。逐行读取第二个文件。连接两列并将结果存储在哈希表中。逐行读取第一个文件并连接其两列以获得类似的键。检查哈希表是否包含相同的键。如果没有,请将数据保存到第三个文件。

Create a hash table. Read the second file row-by-row. Catenate the two columns and store the result in hashtable. Read the first file row-by-row and catenate its two columns to get similar a key. Check if the hashtable contains the same key. If it doesn't, save the data to the third file.

对于代码示例,请提供示例输入和所需的输出。

For a code example, please provide sample input and desired output.

您不指定是否可能有相同的客户,映射但更改其他数据。假设不是这样,只是为整行计算一个散列,如下所示:

You don't specify if it's possible to have same customer, mapping but change in other data. Assuming that's not the case, just calculate a hash for the whole row like so,

# Arraylist's initial size 500,000 elemnents
$secondFile = new-object Collections.ArrayList(500000)
# Init MD5 tools
$md5 = new-object Security.Cryptography.MD5CryptoServiceProvider
$utf8 = new-object Text.UTF8Encoding
# Read the 2nd large file
$reader = [IO.File]::OpenText("c:\temp\secondFileBig.txt")
$i=0
while( ($line = $reader.ReadLine()) -ne $null) {
    # Get MD5 for each row and store it in the arraylist
    $hash = [System.BitConverter]::ToString($md5.ComputeHash($utf8.GetBytes($line)))
    $secondFile.Add($hash) | out-null
    if(++$i % 25000 -eq 0) {write-host -nonewline "."}
}
$reader.Close()
# Sort the arraylist so that it can be binarysearched
$secondFile.Sort()

虚拟数据大约500,000行,创建散列需要大约50秒在我的电脑上。

By using some dummy data about 500,000 rows, creating the hashes takes some 50 seconds on my computer. Now, let's read the other file and check line-by-line if it has same conent.

# Open and read the file row-vise
$reader = [IO.File]::OpenText("c:\temp\firstFileBig.txt")

while( ($line = $reader.ReadLine()) -ne $null) {
    # Get MD5 for current row
    $hash = [System.BitConverter]::ToString($md5.ComputeHash($utf8.GetBytes($line)))
    # If the row already exists in the other file, you'd find its MD5 index with
    # binarysearch in O(n log n) time. If found, you'd get zero or larger index        
    if($secondFile.BinarySearch($hash) -le -1) {
        "Not found: $line"
    }
}
$reader.Close()

使用虚拟测试数据运行第二部分方法更快,因为可以找到 Measure-Command 。它留给读者作为练习,以了解如何提取相关的元素。

Running the second part with dummy test data is way faster, as one can find out with Measure-Command. It is left as an exercise to the reader to figure out how to extract the relevant elements.

这篇关于在两个不同的txt文件窗口中比较两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆