如何在不等行的情况下通过公共列合并两个csv文件? [英] How can I merge two csv files by a common column, in the case of unequal rows?

查看:116
本文介绍了如何在不等行的情况下通过公共列合并两个csv文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组100个文件。 50个文件,包含每个美国州的人口普查信息。另外五十个是需要与每个州的正确文件合并的地理数据。

I have a set of 100 files. 50 files containing census information for each US state. The other fifty are geographic data that need to be merged with the correct file for each state.

对于每个州,人口普查文件及其对应的地理文件通过常见变量LOGRECNO,即人口普查文件中的第10列和geo文件中的第7列。

For each state, the census file and its corresponding geo file are related by a common variable, LOGRECNO, that is the 10th column in the census file and the 7th column in the geo file.

问题是,geo文件的行数大于人口普查文件;我的人口普查数据不包括地理位置的某些子集,因此比geo数据文件少的行。

The problem is that the geo file has more rows than the census file; my census data does not cover certain subsets of geographic locations and hence has fewer rows than the geo data file.

如何将人口普查数据与地理日期合并只有普查数据存在的行/地理位置,不在乎其余)?

How can I merge the census data with the geographic date (keeping only the rows/geo locations where census data exists, don't care about the rest)?

我是Python的新手,我有点知道如何做基本的csv文件i / o在python中。

I am a newbie to Python and I somewhat know how to do basic csv file i/o in python. Manipulating 2 csvs at the same time is proving confusing.

示例:

sample_state_census.csv

sample_state_census.csv

Varname 1 Varname 2 ... Varname 10 (LOGRECNO) ... Varname 16000
xxx       xxx    ...       1             ...               xxx
xxx       xxx    ...       2             ...               xxx
...
...
xxx       xxx   ...        514           ...                xxx
xxx       xxx   ...        1312          ...                xxx
...
...
xxx       xxx   ...        1500          ...                xxx

sample_state_geo.csv

sample_state_geo.csv

GeoVarname 1 GeoVarname 2 ... GeoVarname 7 (LOGRECNO) ... GeoVarname 65
yyy       yyy    ...       1             ...               yyy
yyy       yyy    ...       2             ...               yyy
...
...
yyy      yyy  ...        514           ...                yyy
yyy      yyy   ...        515          ...                yyy
...
...
yyy     yyy  ...        1500          ...                yyy

预期输出(不要为LOGRECNO的值合并行,

Expected output (don't merge rows for values of LOGRECNO that don't exist in sample_state_census.csv)

Varname 1 Varname 2 ... Varname 10 (LOGRECNO) GeoVarname 1 GeoVarname 2 ... GeoVarname 65 Varname 11... Varname 16000 
xxx       xxx    ...       1  yyy yyy ... yyy xxx            ...               xxx
xxx       xxx    ...       2 yyy yyy ... yyy xxx            ...               xxx
...
...
xxx       xxx   ...        514    yyy yyy ... yyy xxx       ...                xxx
xxx       xxx   ...        1312      yyy yyy ... yyy xxx    ...                xxx
...
...
xxx       xxx   ...        1500    yyy yyy ... yyy xxx      ...                xxx


推荐答案

将数据从较短的文件读入内存, c> LOGRECNO row:

Read data from the shorter file into memory, into a dictionary keyed on the LOGRECNO row:

import csv

with open('sample_state_census.csv', 'rb') as census_file:
    reader = csv.reader(census_file, delimiter='\t')
    census_header = next(reader, None)  # store header
    census = {row[9]: row for row in reader}

对地理数据,写出匹配:

then use this dictionary to match against the geo data, write out matches:

with open('sample_state_geo.csv', 'rb') as geo_file:
    with open('outputfile.csv', 'wd') as outfile:
        reader = csv.reader(geo_file, delimiter='\t')
        geo_header = next(reader, None)  # grab header
        geo_header.pop(6) # no need to list LOGRECNO header twice

        writer = csv.writer(outfile, delimiter='\t')
        writer.writerow(census_header + geo_header)

        for row in reader:
            if row[6] not in census:
                # no census data for this LOGRECNO entry
                continue
            # new row is all of the census data plus all of geo minus column 7
            newrow = census[row[6]] + row[:6] + row[7:]
            writer.writerow(newrow)

这一切都假设人口普查文件不大,多记忆。如果是这种情况,你将不得不使用数据库(读取所有数据到SQLite数据库,匹配同样的地理数据)。

This all assumes the census file is not so big as to take up too much memory. If that's the case you'll have to use a database instead (read all data into a SQLite database, match in the same vein agains the geo data).

这篇关于如何在不等行的情况下通过公共列合并两个csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆