优化python文件比较脚本 [英] Optimize python file comparison script

查看:78
本文介绍了优化python文件比较脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个可行的脚本,但我猜这不是最有效的。我需要做的是:

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:


  • 比较两个包含用户信息的csv文件。本质上,这是一个成员列表,其中一个文件是另一个文件的更新版本。

  • 文件包含ID,名称,状态等信息,

  • 仅将新文件中不存在于较旧文件中或包含更新信息的记录写入第三个csv文件。对于每条记录,都有一个唯一的ID,可让我确定记录是新记录还是以前存在的记录。

这里是代码到目前为止,我已经写过:

Here is the code I have written so far:

import csv

fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)

fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)

fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)

old = []
new = []

for row in fOld:
    old.append(row)
for row in fNew:
    new.append(row)

output = []

x = len(new)
i = 0
num = 0

while i < x:
    if new[num] not in old:
        fNewUpdate.writerow(new[num])

    num += 1
    i += 1

fileAin.close()
fileBin.close()
fileCout.close()

就功能而言,此脚本有效。但是,我正在尝试对包含成千上万条记录的文件运行此过程,并且需要几个小时才能完成。我猜问题出在将两个文件都读取到列表中并将整行数据作为单个字符串进行比较。

In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.

我的问题是,因为我要尝试的是这样一种更快,更高效的方式来处理两个文件以创建仅包含new和更新记录?我真的没有目标时间,只是想了解Python是否有更好的方法来处理这些文件。

My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.

在此先感谢您的帮助。

更新以包含数据示例行:

UPDATE to include sample row of data:

123456789,34,DOE,JOHN,1764756,1234 MAIN ST 。,CITY,STATE,305,1,A

123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A

推荐答案

这样的事情怎么样?您的代码最大的低效率之一就是每次都要检查new [num]是否旧,因为old是一个列表,因此您必须遍历整个列表。使用字典要快得多。

How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.

import csv

fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)

fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)

fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)

old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()

output = {}

for row_id in new:
    if row_id not in old or not old[row_id] == new[row_id]:
        output[row_id] = new[row_id]

for row_id in output:
    fNewUpdate.writerow([row_id] + output[row_id])


fileCout.close()

这篇关于优化python文件比较脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆