比较2个文本文件中的行 [英] Compare lines in 2 text files
问题描述
我有两个CSV格式的大文本文件(超过200,000行).我需要逐行比较它们,但是字段可能会在每行中切换.
I have two large text files (200,000+ lines), CSV format. I need to compare them, line by line, but the fields maybe switched within each line.
示例文件A.csv
:
AAA,BBB,,DDD
EEE,,GGG,HHH
III,JJJ,KKK,LLL
示例文件B.csv
:
AAA,,BBB,DDD
EEE,,GGG,HHH
LLL,KKK,JJJ,III
因此,出于我的目的,即使在第一行和最后一行中切换字段,A.csv
和B.csv
也应相同".由于每个文件中的字段的顺序可能不同,因此grep或diff之类的常规选项将无效.
So for my purposes, A.csv
and B.csv
should be "identical" even though fields are switch in the first and last line. Since the fields in each file might be in a different order, the usual options like grep or diff won't work.
基本上,我认为我需要写一些东西来读取A.csv
和B.csv
行,并检查是否所有字段都出现在这两行中,而与顺序无关.另外,也可以在读取行后对字段进行排序.
Basically, I think I need to write something that reads a line of A.csv
and B.csv
, and checks if all fields are present in both lines, independent of the order. Alternatively, something that orders the fields after reading the lines.
推荐答案
您可以标准化检查,而不会影响数据.
You can normalize the check, without affecting the data.
with open('big1.csv') as i, open('big2.csv') as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
print('{} does not match {}'.format(linea, lineb))
这篇关于比较2个文本文件中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!