尝试比较两个csv文件并将差异写为输出 [英] Trying to compare two csv files and write differences as output

查看:338
本文介绍了尝试比较两个csv文件并将差异写为输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个脚本,该脚本将两个csv文件之间的差异并以一个新的csv文件作为输出,但只有两个输入文件之间相同的两行(指行号)包含不同的数据时才有差异,例如在文件1中,第3行具有"mike",篮球运动员",在文件2中,第3行具有"mike",棒球运动员".输出的csv会抓取这些打印出来的内容,并将它们写入到csv中.它可以工作,但是存在一些问题(我知道这个问题之前也曾被问过几次,但是其他人对我的回答却有所不同,并且由于我是编程的新手,所以我不太了解他们的代码).

I'm developing a script which takes the difference between 2 csv files and makes a new csv file as output with the differences BUT only if the same 2 rows (refers to row number) between the two input files contain different data e.g. row 3 has "mike", "basketball player" in file 1 and row 3 in file 2 has "mike", "baseball player". The output csv would grab these print them and write them to a csv. It works but there are some issues (I know that this question has also been asked several times before but others have done it differently to me and since I'm fairly new to programming I don't quite understand their codes).

新的csv文件中的输出在每个单元格中都有输出的每个字母(请参见下图),我相信这与定界符/quotechar/引号第37行有关.我希望它们在自己的单元格中没有任何分隔符句号,多个空格,逗号或"|".

The output in the new csv file has each letter of the output in each cell (see image below) and I believe its something to do with the delimiter/quotechar/quoting line 37. I want them in their own cells without any fullstops, multiple spaces, commas or "|".

另一个问题是运行需要很长时间.我正在处理多达50,000行的数据集,这可能需要一个多小时才能运行.为什么会这样?有什么建议可以加快速度?将某些内容放在for循环之外?我早些时候确实尝试了difflib方法,但是我只能打印整个"input_file1",而不能将该文件与另一个文件进行比较.

Another issue is that it takes a long time to run. I'm working with datasets of up to 50,000 rows and it can take over an hour to run. Why is this and what advice would be useful to speed it up? Put something outside of the for loop maybe? I did try the difflib method earlier on but I was only able to print the entire "input_file1" but not compare that file with another.

# aim of script is to compare csv files and output difference as a new csv

# import necessary libraries
import csv

# File1 = open(raw_input("path:"),"r") #filename, mode
# File2 = open(raw_input("path:"),"r") #filename, mode

# selects the 2 input files to be compared
input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv"
input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv"
# creates the blank output csv file
output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv"
a = open(input_file1, "r")
output_file = open(output_path,"w")
output_file.close()
count = 0

with open(input_file1) as fp1:


    for row_number1, row_value1 in enumerate(fp1):
        if row_number1 == count:
            print "got to 1st point"
            value1 = row_value1

            with open(input_file2) as fp2:
                for row_number2, row_value2 in enumerate(fp2):
                    if row_number2 == count:
                        print "got to 2nd point"
                        value2 = row_value2

                        if value1 == value2:
                            print value1, value2
                        else:
                            print value1, value2
                            with open(output_path, 'wb') as f:
                                writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
                                # testing to see if the code writes text to the csv
                                writer.writerow(["test1"])
                                writer.writerow(["test2", "test3", "test4"])
                                writer.writerows([value1, value2])
                                print "code reached writing stage"
        count += 1
        print count
print "done"
# replace(",",".")

推荐答案

由于您要逐行比较两个文件,因此您应该遍历第二个文件以获取第一个文件中的每行.您可以简单地 zip 两个csv阅读器并过滤行:

Since you want to compare the two files line-by-line, you should not loop through the second file for every line in the first file. You can simply zip two csv readers and filter the rows:

input_file1 = "foo"
input_file2 = "bar"
output_path = "baz"

with open(input_file1) as fin1:
  with open(input_file2) as fin2:
    read1 = csv.reader(fin1)
    read2 = csv.reader(fin2)
    diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
    with open(output_path, 'w') as fout:
      writer = csv.writer(fout)
      writer.writerows(diff_rows)

此解决方案假定两个文件的行数相同.

This solution assumes that the two files have the same number of lines.

这篇关于尝试比较两个csv文件并将差异写为输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆