比较2个文本文件中具有不同列数的行 [英] Compare lines in 2 text files with different number of columns
问题描述
这是我上一个问题的补充(比较2个文本文件中的行)
A.csv / code>:
AAA,BBB,CCC
DDD ,, EEE
GGG, HHH,III
B.csv
:
AAA ,, BBB,CCC
EEE ,, DDD ,,
,, GGG,III,HHH
我希望这些是相同的,即使它们具有不同的字段顺序和列数。 p>
这是我到目前为止:
#!/ usr / bin / python
import sys
import csv
f1 = sys.argv [1]
f2 = sys.argv [2]
打开(f1)为i,打开(f2)为j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
如果排序(map(str.lower,linea))!= sorted(map(str.lower,lineb)):
print('{} }'。format(linea,lineb))
更新:
这是我结束了(感谢@keksnicoh):
#!/ usr / python
import sys
import csv
f1 = sys.argv [1]
f2 = sys.argv [2]
open(f1)as i,open(f2)as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
seta = set([对于x in linea如果len(x)> 0])
setb = set([如果len(x)> 0,则为x的x)]
if(seta!= setb):
print不匹配:{}'。format(a.line_num,seta ^ setb))
问题I面现在是:如何处理重复,例如:
示例文件 A.csv
:
1,2 ,,
1,2,2,3,4
示例文件 B.csv
:
1,2,2,2
1,2, 3,4
上面的脚本认为文件是相同的,但不是。从搜索Stackoverflow,似乎我不能使用一个集,但必须使用一个列表。但是我失去了使用集合的优势,这是不必担心字段的顺序。
如何修改我的代码来考虑重复的条目?
您可以将线条映射到一个集合并过滤空字符串。现在计算这些集合的对称差异并检查新集合的长度。
#!/ usr / bin / python
import sys
import csv
f1 = sys.argv [1]
f2 = sys.argv [2]
打开f1)as i,open(f2)as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next (b)如果len(x)> 0],则
seta = set([如果len(x)> 0] )
print(len(seta ^ setb)== 0)
这更紧凑
for seta in(set([x for l if len(x)> 0] l in a):
setb = set([x for next(b)if len(x)> 0])
print(len(seta ^ setb)== 0)
UPDATE
为了保持容易,当然可以检查
seta == setb
对混乱感到遗憾...
This is an addition to my previous question (Compare lines in 2 text files).
Consider these 2 example files:
A.csv
:
AAA,BBB,CCC
DDD,,EEE
GGG,HHH,III
B.csv
:
AAA,,BBB,CCC
EEE,,DDD,,
,,GGG,III,HHH
I want these to be identical, even though they have different field orders and number of columns.
This is what I have so far:
#!/usr/bin/python
import sys
import csv
f1 = sys.argv[1]
f2 = sys.argv[2]
with open(f1) as i, open(f2) as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
print('{} does not match {}'.format(linea, lineb))
Update:
Here is what I ended up with (thanks @keksnicoh):
#!/usr/bin/python
import sys
import csv
f1 = sys.argv[1]
f2 = sys.argv[2]
with open(f1) as i, open(f2) as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
seta = set([x for x in linea if len(x) > 0])
setb = set([x for x in lineb if len(x) > 0])
if (seta != setb):
print('Line {} does not match: {}'.format(a.line_num, seta ^ setb))
The issue I face now is: how to deal with duplicates, for example:
Example file A.csv
:
1,2,,
1,2,2,3,4
Example file B.csv
:
1,2,2,2
1,2,3,4
The script above considers the files to be identical, but they are not. From searching Stackoverflow, it seems that I cannot use a set but have to use a list. But then I lose the advantage of using sets, which is no having to worry about the order of fields.
How can I modify my code to consider duplicate entries as well?
You could map the lines to a set and filter the empty strings. Now calculate the symmetric difference of those sets and check the length of that new set.
#!/usr/bin/python
import sys
import csv
f1 = sys.argv[1]
f2 = sys.argv[2]
with open(f1) as i, open(f2) as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
seta = set([x for x in linea if len(x) > 0])
setb = set([x for x in lineb if len(x) > 0])
print(len(seta^setb)==0)
Also you can write this more compact
for seta in (set([x for x in l if len(x) > 0]) for l in a):
setb = set([x for x in next(b) if len(x) > 0])
print(len(seta^setb)==0)
UPDATE
to keep things easy, one can of course check for
seta==setb
sorry for confusion...
这篇关于比较2个文本文件中具有不同列数的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!