合并与公共列非常大的CSV文件 [英] Merging very large csv files with common column

查看:173
本文介绍了合并与公共列非常大的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有两个的CSV文件,
0.csv

For example, I have two csv files, 0.csv

100a,a,b,c,c
200a,b,c,c,c
300a,c,d,c,c


1.csv

and 1.csv

100a,Emma,Thomas
200a,Alex,Jason
400a,Sanjay,Gupta
500a,Nisha,Singh

和我想的输出成为像

100a,a,b,c,c,Emma,Thomas
200a,b,c,c,c,Alex,Jason
300a,c,d,c,c,0,0
400a,0,0,0,0,Sanjay,Gupta
500a,0,0,0,0,Nisha,Singh

我如何做,在UNIX Shell脚本或Perl?我知道Unix的加盟命令,将与小文件的工作。例如,要得到我的结果,我也只是做

How do I do that in Unix shell script or Perl? I know the unix "join" command, and that would work well with the small files. For example to get my result I could just do

join -t , -a 1 -a 2 -1 1 -2 1 -o 0 1.2 1.3 1.4 1.5 2.2 2.3 -e "0" 0.csv 1.csv

但是这不是我的目的是可行的,因为我的实际数据文件有超过一百万的列(在千兆字节总数据大小),因此我的UNIX命令也将是长期的超过一百万字。这可能是最重要的头痛,效率低下code被陷入了相当快的。

but that is not feasible for my purposes, since my actual data file has more than a million columns (total data size in the gigabytes), and thus my unix command would also be more than a million characters long. This might be the most important headache, as inefficient code gets bogged down quite fast.

另外请注意,我需要的占位符0,每当有丢失数据。这prevents我从单纯使用这种

Also note that I need the placeholder character "0" whenever there is missing data. This prevents me from simply using this

join -t , -a 1 -a 2 -1 1 -2 1 0.csv 1.csv

也是初学Perl程序员,所以一些细节真的欢迎。我想preFER解决方案是Perl或shell脚本,但实际上任何工作就可以了。

Also a beginner Perl programmer, so some details really welcome. I'd prefer the solution to be perl or shell script, but really anything that works would be fine.

推荐答案

如果你可以添加一个头的每个文件,那么你可以使用的制表来解决这个问题。例如:

If you can add a header to each file, then you could use tabulator to solve the problem. Example:

0.csv:

key,letter_1,letter_2,letter_3,letter_4
100a,a,b,c,c
200a,b,c,c,c
300a,c,d,c,c

1.csv:

key,name_1,name_2
100a,Emma,Thomas
200a,Alex,Jason
400a,Sanjay,Gupta
500a,Nisha,Singh

然后 tbljoin -lr -n 0 0.csv 1.csv 产生

key,letter_1,letter_2,letter_3,letter_4,name_1,name_2
100a,a,b,c,c,Emma,Thomas
200a,b,c,c,c,Alex,Jason
300a,c,d,c,c,0,0
400a,0,0,0,0,Sanjay,Gupta
500a,0,0,0,0,Nisha,Singh

需要注意的是(相对于纯UNIX 加入命令),输入文件不需要进行排序;此外,您不必担心内存的消耗,因为实现是基于UNIX的排序,并会采取基于文件的合并排序对大文件。

Note that (in contrast to pure unix join command), the input files don't need to be sorted; also, you don't need to worry about memory consumption, since the implementation is based on unix sort, and will resort to file-based merge sort for large files.

这篇关于合并与公共列非常大的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆