比较2个文本文件中具有不同列数的行 [英] Compare lines in 2 text files with different number of columns

查看:154
本文介绍了比较2个文本文件中具有不同列数的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我上一个问题的补充(比较2个文本文件中的行)

A.csv / code>:

  AAA,BBB,CCC 
DDD ,, EEE
GGG, HHH,III

B.csv

  AAA ,, BBB,CCC 
EEE ,, DDD ,,
,, GGG,III,HHH



我希望这些是相同的,即使它们具有不同的字段顺序和列数。 p>

这是我到目前为止:

 #!/ usr / bin / python 
import sys
import csv

f1 = sys.argv [1]
f2 = sys.argv [2]

打开(f1)为i,打开(f2)为j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
如果排序(map(str.lower,linea))!= sorted(map(str.lower,lineb)):
print('{} }'。format(linea,lineb))



更新:



这是我结束了(感谢@keksnicoh):

 #!/ usr / python 
import sys
import csv

f1 = sys.argv [1]
f2 = sys.argv [2]

open(f1)as i,open(f2)as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
seta = set([对于x in linea如果len(x)> 0])
setb = set([如果len(x)> 0,则为x的x)]
if(seta!= setb):
print不匹配:{}'。format(a.line_num,seta ^ setb))

问题I面现在是:如何处理重复,例如:



示例文件 A.csv



1,2 ,,

1,2,2,3,4



示例文件 B.csv



1,2,2,2

1,2, 3,4



上面的脚本认为文件是相同的,但不是。从搜索Stackoverflow,似乎我不能使用一个集,但必须使用一个列表。但是我失去了使用集合的优势,这是不必担心字段的顺序。



如何修改我的代码来考虑重复的条目?

解决方案

您可以将线条映射到一个集合并过滤空字符串。现在计算这些集合的对称差异并检查新集合的长度。

 #!/ usr / bin / python 
import sys
import csv

f1 = sys.argv [1]
f2 = sys.argv [2]

打开f1)as i,open(f2)as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next (b)如果len(x)> 0],则
seta = set([如果len(x)> 0] )
print(len(seta ^ setb)== 0)

这更紧凑

  for seta in(set([x for l if len(x)> 0] l in a):
setb = set([x for next(b)if len(x)> 0])
print(len(seta ^ setb)== 0)

UPDATE



为了保持容易,当然可以检查

  seta == setb 

对混乱感到遗憾...


This is an addition to my previous question (Compare lines in 2 text files).

Consider these 2 example files:

A.csv:

AAA,BBB,CCC  
DDD,,EEE  
GGG,HHH,III

B.csv:

AAA,,BBB,CCC  
EEE,,DDD,,  
,,GGG,III,HHH

I want these to be identical, even though they have different field orders and number of columns.

This is what I have so far:

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
            print('{} does not match {}'.format(linea, lineb))

Update:

Here is what I ended up with (thanks @keksnicoh):

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        seta = set([x for x in linea if len(x) > 0])
        setb = set([x for x in lineb if len(x) > 0])
        if (seta != setb):
            print('Line {} does not match: {}'.format(a.line_num, seta ^ setb))

The issue I face now is: how to deal with duplicates, for example:

Example file A.csv:

1,2,,
1,2,2,3,4

Example file B.csv:

1,2,2,2
1,2,3,4

The script above considers the files to be identical, but they are not. From searching Stackoverflow, it seems that I cannot use a set but have to use a list. But then I lose the advantage of using sets, which is no having to worry about the order of fields.

How can I modify my code to consider duplicate entries as well?

解决方案

You could map the lines to a set and filter the empty strings. Now calculate the symmetric difference of those sets and check the length of that new set.

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        seta = set([x for x in linea if len(x) > 0])
        setb = set([x for x in lineb if len(x) > 0])
        print(len(seta^setb)==0)

Also you can write this more compact

for seta in (set([x for x in l if len(x) > 0]) for l in a):
    setb = set([x for x in next(b) if len(x) > 0])
    print(len(seta^setb)==0)

UPDATE

to keep things easy, one can of course check for

seta==setb

sorry for confusion...

这篇关于比较2个文本文件中具有不同列数的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆