在PYTHON中的两个CSV文件中查找公共区域 [英] Find Common Region in two CSV File in PYTHON

查看:93
本文介绍了在PYTHON中的两个CSV文件中查找公共区域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个CSV文件,每个文件有10列,其中第一列称为主键".

I have two CSV files with 10 columns each where the first column is called the "Primary Key".

我需要使用Python查找两个CSV文件之间的公共区域.例如,我应该能够检测到CSV1中的第27-45行等于CSV2中的第125-145行,依此类推.

I need to use Python to find the common region between the two CSV files. For example, I should be able to detect that rows 27-45 in CSV1 is equal to rows 125-145 in CSV2 and so on.

我只比较主键(第一列).其余数据不考虑进行比较.我需要将这些公共区域提取到两个单独的CSV文件中(一个用于CSV1,一个用于CSV2).

I am only comparing the Primary Key (Column One). The rest of the data is not considered for comparison. I need to extract these common regions in two separate CSV files (one for CSV1 and one for CSV2).

我已经解析了两个CSV文件的行并将其存储在两个列表列表"(lstCAN_LOG_TABLElstSHADOW_LOG_TABLE)中,因此该问题可以减少到比较这两个列表列表中.

I have already parsed and stored the rows of the two CSV files in two 'list of lists', lstCAN_LOG_TABLE and lstSHADOW_LOG_TABLE, so the problem reduces down to comparing these two list of lists.

我目前假设的是,如果以后有10个匹配项(MAX_COMMON_THRESHOLD),则说明我已经到达一个公共区域的开头.我不能记录单行(与true相比),因为会有相等的区域(按主键)和需要标识的区域.

I am currently assuming is that if there are 10 subsequent matches (MAX_COMMON_THRESHOLD), I have reached the beginning of a common region. I must not log single rows (comparing to true) because there would be regions equal (As per primary key) and those regions I need to identify.

for index in range(len(lstCAN_LOG_TABLE)):
    for l_index in range(len(lstSHADOW_LOG_TABLE)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
            can_index = index
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1                           
                    else:
                        break   
                else:
                    bPreScreened = 0
                    print("No Match")
                    break
            #we might have found start of common region         
            if(bPreScreened == 1):      
                print("Start={0} End={1} can_index={2}".format(start,end,can_index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):                           
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1                           
                        else:
                            dump_file.close()   
                            print("\nCommon Region in Two CSVs identifed and recorded\n")                           
                            return
dump_file.close()   
print("\nCommon Region in Two CSVs identifed and recorded\n")

我得到奇怪的输出.即使第一个CSV文件也只有1880行,但是在第一个CSV的公共区域CSV中,我得到了更多的条目.我没有得到想要的输出.

I am getting strange output. Even the first CSV file has only 1880 Rows but in the common region CSV for the first CSV I am getting many more entries. I am not getting desired output.

在此处编辑

CSV1:

216 0.000238225 F4  41  C0  FB  28  0   0   0   MS CAN
109 0.0002256   15  8B  31  0   8   43  58  0   HS CAN
216 0.000238025 FB  47  C6  1   28  0   0   0   MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

CSV2:

216 0.0002361   0F  5C  DB  14  28  0   0   0   MS CAN
216 0.000236225 16  63  E2  1B  28  0   0   0   MS CAN
109 0.0001412   16  A3  31  0   8   63  58  0   HS CAN
216 0.000234075 1C  6A  E9  22  28  0   0   0   MS CAN
40A 0.000259925 C1  1   46  54  30  44  47  36  HS CAN
4A  0.000565975 2   0   0   0   0   0   0   C0  MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

预期输出CSV1:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

预期输出CSV2:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

观察到的输出CSV1

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

以及数千个冗余行数据

已编辑-已按建议解决(更改为白色):

学习: 在Python FOR中无法在运行时更改循环索引

dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))  
index = 0   
while(index < (input_file_one_row_count - 1)):
    l_index = 0
    while(l_index < (input_file_two_row_count - 1)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            can_index = index
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))              
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):                      
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1                           
                    else:
                        break   
                else:
                    bPreScreened = 0
                    break
            #we might have found start of common region         
            if(bPreScreened == 1):      
                print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):                           
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1
                        if((l_index + 1) < (input_file_two_row_count-1)):
                            l_index = l_index + 1                               
                        else:
                            dump_file.close()   
                            print("\nCommon Region in Two CSVs identifed and recorded\n")                           
                            return
            else:
                l_index = l_index + 1
        else:
            l_index = l_index + 1
    index = index + 1   
dump_file.close()   
print("\nCommon Region in Two CSVs identifed and recorded\n")

推荐答案

indexfor循环中的迭代器.如果您在循环中更改了它,它将在每次迭代后重新分配.

index is the iterator in your for loop. Should you changed it inside the loop, it will be reassigned after each iteration.

说,在for循环中的index = 5index += 1被执行3次时.现在index = 8.但是,在此迭代结束之后,当您的代码返回到for时,索引将被分配给index x = 6.

Say, when index = 5 in your for loop, and index += 1 was execute 3 times. Now index = 8. But after this iteration ends, when your code go back to for, index will be assigned to index x = 6.

尝试以下示例:

for index in range(0,5):
    print 'iterator:', index
    index = index + 2
    print 'index:', index

输出将是:

iterator: 0
index: 2
iterator: 1
index: 3
iterator: 2
index: 4
iterator: 3
index: 5
iterator: 4
index: 6

要解决此问题,您可能需要将for循环更改为while循环

To fix this issue, you might want to change your for loop to while loop

如果我不理解错误,则说明您试图在两个文件中查找相同"列并将其存储. 如果是这种情况,实际上可以通过使用以下代码轻松完成您的工作:

If I didn't understand wrong, you were trying to find 'same' columns in two files and store them. If this is the case, actually your work can be done easily by using following code:

import csv # import csv module to read csv files

file1 = 'csv1.csv' # input file 1
file2 = 'csv2.csv' # input file 2
outfile = 'csv3.csv' # only have one output file since two output files will be the same

read1 = csv.reader(open(file1, 'r')) # read input file 1
write = csv.writer(open(outfile, 'w')) # write to output file

# for each row in input file 1, compare it with each row in input file 2
# if they are the same, write that row into output file
for row1 in read1:
    read2 = csv.reader(open(file2, 'r'))
    for row2 in read2:
        if row1 == row2:
            write.writerow(row1)

read1.close()
write.close()

这篇关于在PYTHON中的两个CSV文件中查找公共区域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆