逐行比较文件以查看它们是否相同,如果是,则输出它们 [英] Compare files line by line to see if they are the same, if so output them

查看:71
本文介绍了逐行比较文件以查看它们是否相同,如果是,则输出它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将如何处理这个问题,我已经将信息排序在其中的文件中,我想将该文件中的某个索引与另一个文件中的索引进行比较,一个问题是文件非常大,行数百万.我想逐行比较我拥有的文件,如果它们匹配,我想使用索引方法将这些值与其他值一起输入.

How would I go about this, I have files which I have sorted the information in, I want to compare a certain index in that file with an index in another, one problem is that the files are enormously large, millions of lines. I want to compare line by line the files I have, if they match I want to input both those values along with other values using an index method.

======================

=======================

让我澄清一下,我想说line [x] x将保持与统一格式相同,我想在另一个文件中对line [y]运行line [x]整个文件,然后将每个匹配对输出到另一个文件.在另一个文件中,我还希望能够包含第一个文件中的其他片段,就像只是添加更多索引,例如; line [a],line [b],line [c],line [d],最后是line [y]作为该信息的匹配项.

Let me clarify, I want to take say line[x] the x will remain the same as it is formatted uniformly, I want to run line[x] against line[y] in another file, I want to do this to the whole file and output every matching pair to another file. In that other file I also want to be able to include other pieces from the first file which would be like just adding more indexes such as; line[a],line[b],line[c],line[d], and finally line[y] as the match to that information.

尝试3:

我有一个文件,其信​​息格式如下:

I have a file with information in this format:

#x是一行

 x= data,data,data,data,data,data

其中有数百万行.

我还有另一个文件,格式相同:

I have another file, same format:

    xis a line
    x= data,data,data,data

我想使用第一个文件中的x [#]和第二个文件中的x [#],我想看看这两个值是否匹配,如果要匹配,我想将其与其他几个x [#]一起输出第二个文件中的值位于同一行.

I want to use x[#] from first file and x[#] from second file, I want to see if those two values match, if they do I want to output those, along with several other x[#] values from the second file, which are on the same line.

这完全有助于您理解吗? 文件所使用的格式就像我说的那样:(但是有数百万个,我想在两个文件中找到对,因为它们都应该匹配)

Did that help at all to understand? The format the files are in are like i said:(but there is millions, and I want to find the pairs in the two files because they all should match up)

  line 1  data,data,data,data
  line 2  data,data,data,data

文件1中的数据

 (N'068D556A1A665123A6DD2073A36C1CAF', N'A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67',    
N'D7C970DFE09687F1732C568AE1CFF9235B2CBB3673EA98DAA8E4507CC8B9A881');

文件2中的数据:

00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux

排序的顺序是字母数字,我想使用滑块方法.我的意思是,如果file1 [x]是< file2 [x]根据一个值是否大于另一个值来上下移动滑块,直到找到匹配项为止;如果找到匹配项,则将输出以及其他将标识该哈希值的值打印出来.

The order it is sorted in is alphanumeric, and I want to use a slider method. By that I mean if file1[x] is < file2[x] move the slider down or up depending on whether one value is greater than the other, until a match is found, when and if so, print the output along with other values that will identify that hash.

我想要的结果是:

file1 [x]及其在输出到文件的file2 [x]上的对应匹配,以及其他file1 [x],其中x可以是该行的任何索引.

file1[x] and its corresponding match on file2[x] outputted to a file, as well as other file1[x] where x can be any index from the line.

推荐答案

我从澄清中得到了什么:

What I got from the clarification:

  • file1和file2的格式相同,每一行看起来像

  • file1 and file2 are in the same format, where each line looks like

{32 char hex key}|{text1}|{text2}|{text3}

  • 文件按键升序排列

  • the files are sorted in ascending order by key

    对于出现在文件1和文件2中的每个键,您都希望合并输出,因此每一行看起来像

    for each key that appears in both file1 and file2, you want merged output, so each line looks like

    {32 char hex key}|{text11}|{text12}|{text13}|{text21}|{text22}|{text23}
    

  • 您基本上希望合并排序中的冲突:

    You basically want the collisions from a merge sort:

    import csv
    
    def getnext(csvfile, key=lambda row: int(row[0], 16)):
        row = csvfile.next()
        return key(row),row
    
    with open('file1.dat','rb') as inf1, open('file2.dat','rb') as inf2, open('merged.dat','wb') as outf:
        a = csv.reader(inf1, delimiter='|')
        b = csv.reader(inf2, delimiter='|')
        res = csv.writer(outf, delimiter='|')
    
        a_key, b_key = -1, 0
        try:
            while True:
                while a_key < b_key:
                    a_key, a_row = getnext(a)
                while b_key < a_key:
                    b_key, b_row = getnext(b)
                if a_key==b_key:
                    res.writerow(a_row + b_row[1:])
        except StopIteration:
            # reached the end of an input file
            pass
    

    我仍然不知道您试图通过以及其他file1 [x]进行通信"(其中x可以是该行的任何索引).

    I still have no idea what you are trying to communicate by 'as well as other file1[x] where x can be any index from the line'.

    这篇关于逐行比较文件以查看它们是否相同,如果是,则输出它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆