使用join/awk/sed合并CSV文件 [英] Merge CSV files using join/awk/sed

查看:35
本文介绍了使用join/awk/sed合并CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

能否请您帮我找到bash命令,该命令将以下cvs文件"template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv"合并/合并到输出"中.csv".

Could you please help me to find THE bash command which will join/merge those following cvs files "template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv" into "ouput.csv".

对于template.csv中的每一行,将fileX.csv中列出的关联值(如果存在)连接起来,如下所示:

For each line in template.csv, concatenate associated values (if exist) listed in the fileX.csv as below:

template.csv:

template.csv:

header
1
2
3
4
5
6
7
8
9

file1.csv:

file1.csv:

header,value1
2,value12
3,value13
7,value17
8,value18
9,value19

file2.csv:

file2.csv:

header,value2
1,value21
2,value22
3,value23
4,value24

file3.csv:

file3.csv:

header,value3
2,value32
4,value34
6,value36
7,value37
8,value38

output.csv:

output.csv:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

我的模板文件包含35137行.
我已经开发了一个执行此合并的bash脚本(基于"do while"等),但性能根本不佳.太长,无法输出output.csv.我敢肯定,可以使用join,awk,...来做同样的事情,但是我不知道如何...

My template file is containing 35137 lines.
I already developed a bash script doing this merge (based on "do while", etc...) but the performance is not good at all. Too long to make the output.csv. I'm sure that it is possible to do the same using join, awk, ... but I don't see how ...

重要更新

我的真实文件的第一列包含一个日期时间,而不是一个简单的数字...因此脚本必须考虑日期和时间之间的空格...抱歉更新!

The first column of my real files are containing a datetime and not a simple number ... so the script must take into account the space between the date and the time ... sorry for the update !

现在应使用以下csv文件作为示例来设计脚本:

Script should be now designed with the below csv files as example:

template.csv:

template.csv:

header
2000-01-01 00:00:00
2000-01-01 00:15:00
2000-01-01 00:30:00
2000-01-01 00:45:00
2000-01-01 01:00:00
2000-01-01 01:15:00
2000-01-01 01:30:00
2000-01-01 01:45:00
2000-01-01 02:00:00

file1.csv:

file1.csv:

header,value1
2000-01-01 00:15:00,value12
2000-01-01 00:30:00,value13
2000-01-01 01:30:00,value17
2000-01-01 01:45:00,value18
2000-01-01 02:00:00,value19

file2.csv:

file2.csv:

header,value2
2000-01-01 00:00:00,value21
2000-01-01 00:15:00,value22
2000-01-01 00:30:00,value23
2000-01-01 00:45:00,value24

file3.csv:

file3.csv:

header,value3
2000-01-01 00:15:00,value32
2000-01-01 00:45:00,value34
2000-01-01 01:15:00,value36
2000-01-01 01:30:00,value37
2000-01-01 01:45:00,value38

output.csv:

output.csv:

header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

推荐答案

您可以使用多个调用来 join :

You could use multiple calls to join :

join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv

或更清晰:

alias myjoin='join -t , -a 1 -o auto'
myjoin template.csv file1.csv | myjoin - file2.csv | myjoin - file3.csv

说明:

  • -t,指定字段分隔符()
  • -a 1 指示打印来自第一个文件的不可配对的行(假设头文件包含所有可能的头)
  • -o auto 控制格式,是打印空白字段所必需的
  • -t , specifies the field separator (,)
  • -a 1 instructs to print unpairable lines coming from the first file (an assumption is made that the header file contains all possible headers)
  • -o auto controls formatting and is necessary to print the empty fields

证明:

$ join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

注意:

为此,文件必须在连接字段(您情况下的标头)上排序.如果不是这种情况,可以使用 sort 命令.

For this to work, the files MUST be sorted on the join fields (the header in your case). You can use the sort command if this is not the case.

这篇关于使用join/awk/sed合并CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆