比较两个文件以了解python中的差异 [英] Compare two files for differences in python
问题描述
我想比较两个文件(从第一个文件中取出一行,然后在整个第二个文件中查找),以查看它们之间的差异,并将缺少的行从fileA.txt写入fileB.txt的末尾。我是python的新手,所以我第一次想到这样的简单程序:
I want to compare two files (take line from first file and look up in whole second file) to see differences between them and write missing line from fileA.txt to end of fileB.txt. I am new to python so at first time I thought abou simple program like this:
import difflib
file1 = "fileA.txt"
file2 = "fileB.txt"
diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines())
print ''.join(diff),
但结果我得到了两个文件的组合,每行带有合适的标签。我知道我可以查找以标签-开头的行,然后将其写入文件fileB.txt的末尾,但是对于大文件(〜100 MB),此方法效率不高。有人可以帮我改进程序吗?
but in result I have got a combination of two files with suitable tags for each line. I know that I can look for line start with tag "-" and then write it to end of file fileB.txt, but with huge file (~100 MB) this method will be inefficient. Can somebody help me to improve program?
文件结构如下:
输入:
fileA.txt
fileA.txt
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
fileB.txt
fileB.txt
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
输出:
fileB_after.txt
fileB_after.txt
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
推荐答案
在 bash
中尝试以下操作:
cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt
排序-M :
根据初始字符串排序,该字符串由任意数量的空格组成,后跟
,并带有月份名称缩写,折叠成大写字母,并以'JAN'< FEB< ...< ‘DEC’。无效的名称将
低至有效的名称。 LC_TIME语言环境确定月份
的拼写。
sort -M: sorts based on initial string, consisting of any amount of whitespace, followed by a month name abbreviation, is folded to UPPER case and compared in the order 'JAN' < 'FEB' < ... < 'DEC'. Invalid names compare low to valid names. The `LC_TIME' locale determines the month spellings.
uniq:过滤出文件中重复的行。
uniq: filters out repeated lines in a file.
|:将一个命令的输出传递给另一个命令以进行进一步处理。
|: passes the output of one command to another for further processing.
这将是什么要做的就是获取两个文件,按照上述方法对它们进行排序,保留唯一项并将其存储在 new_file.txt
What this will do is take the two files, sort them in the way described above, keep the unique items and store them in new_file.txt
注意:这不是python解决方案,但您已使用 linux
标记了该问题,因此我认为您可能会感兴趣。您也可以在此处找到有关所使用命令的更多详细信息。
Note: This is not a python solution but you have tagged the question with linux
so I thought it might interest you. Also you can find more detailed info about the commands used, here.
这篇关于比较两个文件以了解python中的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!