比较多个csv文件并查找匹配项 [英] Comparing multiple csv files and finding matches

查看:85
本文介绍了比较多个csv文件并查找匹配项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个包含csv文件的文件夹.一组主"文件和一组不匹配"文件.在主文件(〜25个文件,总共约50,000行)中,有唯一的ID.不匹配文件的每一行(约250个文件,总共约700,000行)在该行中应具有与一个主文件中的单个ID匹配的ID.在每个不匹配的文件中,所有ID都应与一个主文件匹配.此外,所有不匹配的ID都应归入一个主服务器中.

不幸的是,列并不总是一致的,并且id字段可能出现在row [2]或row [155]中. (我为此使用python)最初是使用set.intersection并查找长度> 5(缺少标记为."或我想避免的空格的值)的匹配实例.运行时间太长.一般来说,我需要将"unmatched"文件与其"master"文件进行匹配,并且我想使用"id"来获取"unmatched"文件中的列索引.因此,如果不匹配的文件unmatched_a的ID主要落在master_d之下,而unmatched_a中的匹配列是第35列,则它将返回一行:

unmatched_a,master_d,35

抱歉,如果不清楚,我很乐意尝试澄清是否需要.关于stackoverflow的第一篇文章.我可以发布到目前为止的代码,但是我认为这没有用,因为问题出在我比较多个(相对)大型csv文件的方法上.我看到很多帖子比较两个csv文件或已知index_id的文件,但是没有多个文件和多个具有潜在匹配项的文件.

解决方案

您必须首先将所有主文件读入内存-这是不可避免的,因为匹配的ID可能在主文件中的任何位置. /p>

然后,对于每个不匹配的文件,您可以读取第一条记录并找到其ID(为您提供ID列),然后找到包含该ID的主文件(为您提供匹配的主文件).根据您的描述,一旦您匹配了第一条记录,所有其他ID都将在同一个文件中,因此您就完成了.

将id读入一个集合中-检查成员资格是否为O(1).将每个集合放入以master_file的名称为键的dict中. O(n)遍历主命令.因此,主文件数和不匹配文件数为O(nm).

import csv

def read_master_file(master_file):
    with open(master_file, "r") as file:
        reader = csv.reader(file)
        ids = set(line[0] for line in file) # I assumed the id is the first value in each row in the master files. Depending on the file format you will want to change this.
    return ids

def load_master_files(file_list):
    return {file: read_master_file(file) for file in file_list}

def check_unmatched_file(unmatched_file, master_dict):
    with open(unmatched_file, "r") as file:
        reader = csv.reader(file)
        record = next(reader)
    for id_column in [2, 155]: # if you can identify an id by semantics, rather than by attempting to match it against the masters, you can reduce running time by 25% by finding the id before this step
        id = record[id_column]
        for master in master_dict:
            if id in master_dict[master]:
                return unmatched_file, master, id
    return None # id not in any master. Feel free to return or raise whatever works best for you

I have two folders with csv files. A group of 'master' files and a group of 'unmatched' files. Within the master files (~25 files, about 50,000 lines in total), there are unique ids. Each row of the unmatched files (~250 files, about 700,000 lines in total) should have an id in the row that matches a single id in one of the master files. Within each of the unmatched files, all id's should match with a single master file. Further, all ids in the unmatched should fall within a single master.

Unfortunately, the columns aren't always consistent, and the id field may appear in row[2] or row[155]. (I'm using python for this) I was originally using set.intersection and finding matched instances where the length > 5 (there are missing values marked with a '.' or just a blank that I wanted to avoid.) but quickly learned the runtime is far too long. Generally speaking, I need to match the 'unmatched' files with its 'master' file, and I'd like to have the column index from the 'unmatched' file with the id used. So if the unmatched file unmatched_a has ids that mostly fall under master_d, and the matching column in unmatched_a is column 35, it would return a line:

unmatched_a, master_d, 35

Apologies if this is unclear -- I'd be happy to try and clarify if need be. First post on stackoverflow. I can post the code I've so far, but I don't think it would be useful because the issue is with my method of comparing multiple (relatively) large csv files. I saw a lot of posts comparing two csv files or files where the index_id is known, but nothing with multiple files and multiple files with potential matches.

解决方案

You'll have to start by reading all the master files into memory -- this is unavoidable, since the matching ids could be anywhere in the master files.

Then, for each unmatched file, you can read the first record and find its id (giving you the id column), then find the master file containing that id (giving you the matched master file). According to your description, once you've matched the first record, all the rest of the ids will be in the same file, so you're done.

Read the ids into a set -- checking membership is O(1). Put each set into a dict keyed to the name of the master_file. Iterating over the dict of masters is O(n). So this is O(nm) for the number of master files and the number of unmatched files.

import csv

def read_master_file(master_file):
    with open(master_file, "r") as file:
        reader = csv.reader(file)
        ids = set(line[0] for line in file) # I assumed the id is the first value in each row in the master files. Depending on the file format you will want to change this.
    return ids

def load_master_files(file_list):
    return {file: read_master_file(file) for file in file_list}

def check_unmatched_file(unmatched_file, master_dict):
    with open(unmatched_file, "r") as file:
        reader = csv.reader(file)
        record = next(reader)
    for id_column in [2, 155]: # if you can identify an id by semantics, rather than by attempting to match it against the masters, you can reduce running time by 25% by finding the id before this step
        id = record[id_column]
        for master in master_dict:
            if id in master_dict[master]:
                return unmatched_file, master, id
    return None # id not in any master. Feel free to return or raise whatever works best for you

这篇关于比较多个csv文件并查找匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆