比较2个文本文件中具有不同列数的行 [英] Compare lines in 2 text files with different number of columns

查看：154 发布时间：2017/2/25 19:56:15 python python-2.7 csv

本文介绍了比较2个文本文件中具有不同列数的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

A.csv / code>：

  AAA，BBB，CCC 
 DDD ,, EEE 
 GGG， HHH，III

B.csv ：

  AAA ,, BBB，CCC 
 EEE ,, DDD ,, 
 ,, GGG，III，HHH

我希望这些是相同的，即使它们具有不同的字段顺序和列数。 p>

这是我到目前为止：

 ＃！/ usr / bin / python 
 import sys 
 import csv 
 
 f1 = sys.argv [1] 
 f2 = sys.argv [2] 
 
打开（f1）为i，打开（f2）为j：
a = csv.reader（i）
b = csv.reader（j）
 for linea in a：
 lineb = next（b）
如果排序（map（str.lower，linea））！= sorted（map（str.lower，lineb））：
 print（'{} }'。format（linea，lineb））

`更新：`

 
 
 这是我结束了（感谢@keksnicoh）：
 ＃！/ usr / python 
 import sys 
 import csv 
 
 f1 = sys.argv [1] 
 f2 = sys.argv [2] 
 
 open（f1）as i，open（f2）as j：
a = csv.reader（i）
b = csv.reader（j）
 for linea in a：
 lineb = next（b）
 seta = set（[对于x in linea如果len（x）> 0]）
 setb = set（[如果len（x）> 0，则为x的x）] 
 if（seta！= setb）：
 print不匹配：{}'。format（a.line_num，seta ^ setb））
  
问题I面现在是：如何处理重复，例如：
 
 
 示例文件 A.csv ：
 
 
  1,2 ,, 
 
 1,2,2,3,4 
 
 
 示例文件 B.csv ：
 
 
  1,2,2,2 
 
 1,2， 3,4 
 
 
 上面的脚本认为文件是相同的，但不是。从搜索Stackoverflow，似乎我不能使用一个集，但必须使用一个列表。但是我失去了使用集合的优势，这是不必担心字段的顺序。
 
 
 如何修改我的代码来考虑重复的条目？ 
解决方案
您可以将线条映射到一个集合并过滤空字符串。现在计算这些集合的对称差异并检查新集合的长度。
 ＃！/ usr / bin / python 
 import sys 
 import csv 
 
 f1 = sys.argv [1] 
 f2 = sys.argv [2] 
 
打开f1）as i，open（f2）as j：
a = csv.reader（i）
b = csv.reader（j）
 for linea in a：
 lineb = next （b）如果len（x）> 0]，则
 seta = set（[如果len（x）> 0] ）
 print（len（seta ^ setb）== 0）
  
这更紧凑
  for seta in（set（[x for l if len（x）> 0] l in a）：
 setb = set（[x for next（b）if len（x）> 0]）
 print（len（seta ^ setb）== 0）
  
  UPDATE  
 
 
 为了保持容易，当然可以检查
  seta == setb 
  
对混乱感到遗憾... 
 
This is an addition to my previous question (Compare lines in 2 text files).

Consider these 2 example files:

A.csv:  
AAA,BBB,CCC  
DDD,,EEE  
GGG,HHH,III
B.csv: 
AAA,,BBB,CCC  
EEE,,DDD,,  
,,GGG,III,HHH
I want these to be identical, even though they have different field orders and number of columns.

This is what I have so far:
#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
            print('{} does not match {}'.format(linea, lineb))


Update:

Here is what I ended up with (thanks @keksnicoh):
#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        seta = set([x for x in linea if len(x) > 0])
        setb = set([x for x in lineb if len(x) > 0])
        if (seta != setb):
            print('Line {} does not match: {}'.format(a.line_num, seta ^ setb))
The issue I face now is: how to deal with duplicates, for example:

Example file A.csv:

1,2,,

1,2,2,3,4

Example file B.csv:

1,2,2,2

1,2,3,4

The script above considers the files to be identical, but they are not.  From searching Stackoverflow, it seems that I cannot use a set but have to use a list.  But then I lose the advantage of using sets, which is no having to worry about the order of fields.

How can I modify my code to consider duplicate entries as well?
 解决方案 
You could map the lines to a set and filter the empty strings. Now calculate the symmetric difference of those sets and check the length of that new set.
#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        seta = set([x for x in linea if len(x) > 0])
        setb = set([x for x in lineb if len(x) > 0])
        print(len(seta^setb)==0)
Also you can write this more compact
for seta in (set([x for x in l if len(x) > 0]) for l in a):
    setb = set([x for x in next(b) if len(x) > 0])
    print(len(seta^setb)==0)
UPDATE

to keep things easy, one can of course check for 
seta==setb
sorry for confusion... 

                        这篇关于比较2个文本文件中具有不同列数的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

比较2个文本文件中具有不同列数的行 [英] Compare lines in 2 text files with different number of columns

问题描述

`更新：`

Update:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较2个文本文件中具有不同列数的行 [英] Compare lines in 2 text files with different number of columns

问题描述

更新：

Update:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

`更新：`

登录关闭