将可变数量的csv文件中的一列合并到一个csv文件中 [英] Merge one column from variable number of csv files into one csv file

查看:79
本文介绍了将可变数量的csv文件中的一列合并到一个csv文件中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处是Python新手.我知道有很多与此相关的SO帖子,但是我所审查的解决方案都没有一个适合我的问题.

Novice Python programmer here. I know there are a lot of SO posts relating to this, but none of the solutions I've reviewed seem to fit my problem.

我有可变数量的csv文件,所有文件都具有相同的列数.第四列的标题将随每个csv文件而变化(这是儒略日期).顺便提及,该第四列存储来自卫星传感器的表面温度.例如:

I have a variable number of csv files, all with the same number of columns. The header for the fourth column will change with each csv file (it's a Julian date). Incidentally, this fourth column stores surface temperatures from a satellite sensor. As an example:

UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010

我想保留前四列(最好是在文件列表中的第一个csv文件中),然后将其余所有csv文件中的第四列加入/合并到该第一个表中.决赛桌看起来像这样:

I would like to keep the first four columns (preferably from the first csv file in my list of files), and then join/merge the fourth column from all the remaining csv files to this first table. The final table will look something like this:

UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339

我知道Pandas软件包可能是一种更简单的方法,但是我宁愿在此工具中不需要第三方软件包(要求用户使用easy_install,PIP等).我也意识到我在RDBMS中会简单得多,但是再次,我不希望这样.所以我只使用csv模块.

I know the Pandas package would probably be an easier way to do this, but I'd rather not require third party packages (requiring the user to use easy_install, PIP, etc.) in this tool. I also realize I this would be much simpler in an RDBMS, but again, I don't want that to be a requirement. So I'm only using the csv module.

我想我知道该怎么做,我假设我应该将合并的行写入新的csv文件.我已经尽力从第一个csv文件中提取标题,然后循环浏览每个随后的csv文件,以将新的列名添加到标题行中.我要讲的是除了第一个csv文件中的行之外,如何仅从第四列中写入值.所有csv文件都有UID列,该列应匹配.

I think I understand how to do this, and I'm assuming I should write the merged rows to a new csv file. I've gotten as far as pulling out the headers from the first csv file, then looping through each of the subsequent csv files to add the new column name to the header row. Where I'm coming up short is how to write values from the fourth column only in addition to the rows from the first csv file. All csv files have UID columns, which should match.

def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
    acq_year = mosaic_io_array[0][0]
    out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
    out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
    # get first csv file in the list of files
    first_file = acq_date_list[0][1]
    # open and read the first csv file
    with open(first_file, 'rb') as first_csv:
        r1 = csv.reader(first_csv, delimeter = ',')
        header1 = next(r1)
        allrows1 = []
        row1 = next(r1)
        allrows1.append(row1)
    # open and write to the new csv
    with open(out_file, 'wb') as out_csv:
        w = csv.writer(out_csv, delimeter = ',')
            # loop through the list of remaining csv files
            for acq_date in acq_date_list[1:]: # skip the first csv file
                # open and read other csv files
                with open(acq_date[1], 'rb') as other_csv:
                    rX = csv.reader(other_csv, delimeter = ',')
                    headerX = next(rX)
                    header_row = '%s,%s' % (header1, headerX)

                    # write header and subsequent merged rows to new csv file?

也许之后:

headerX = next(rX)

我可以将标题行拆分为一个列表,然后取出第四项?这对于其他" csv文件中的其余行是否也有效.还是这通常是错误的方法?

I can split the header row into a list, and pull out the fourth item? Would this also work for the remaining rows in the "other" csv files. Or is this just generally the wrong approach?

更新2/26/2016 实际上,我只得到Gijs的解决方案来部分工作.标题列是迭代添加的,而不是行中其余的值.我仍然不确定如何使用其余csv文件中的值填充空白单元格.

UPDATE 2/26/2016 I actually only got the solution by Gijs to partially work. The header columns are iteratively added, but not the rest of the values from the row. I'm still unsure how to fill in the empty cells with values from the remaining csv files.

Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,

推荐答案

浏览文件,跟踪存在的键并用csv.DictWritercsv.DictReader写入所有记录.

Loop through the files, keep track of which keys exist and write all records with csv.DictWriter and csv.DictReader.

import csv

records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
    with open(fn) as f:
        reader = csv.DictReader(f)
        all_keys.update(set(reader.fieldnames))
        for r in reader:
            records.append(r)

with open("table_merged.csv", "wb") as f:
    writer = csv.DictWriter(f, fieldnames = all_keys)
    writer.writeheader()
    for r in records:
        writer.writerow(r)

这将为没有该列的记录写一个空的单元格".

This will write an empty 'cell' for records that didn't have the column.

使用文件作为第一和第二个.csv,在第二种情况下将最后一列重命名为002而不是001,您将得到以下信息:

With your file as both the first and the second .csv, with in the second case the last column renamed to 002 instead of 001, you would get this:

UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33

如果要按特定顺序保留列,则必须将all_keys设置为list,然后仅在新文件中添加不在all_keys中的列.

If you want to keep the columns in a specific order, you will have to make all_keys a list, and then add only the columns in the new file that are not in all_keys.

all_keys = list()

... 
         all_keys += list(set(reader.fieldnames).difference(set(all_keys)))

这篇关于将可变数量的csv文件中的一列合并到一个csv文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆