如何将单个CSV文件分割成由字段分组的几个较小的文件? [英] How do I slice a single CSV file into several smaller ones grouped by a field?

查看:1046
本文介绍了如何将单个CSV文件分割成由字段分组的几个较小的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从世界银行千年发展目标设置为CSV的大数据。数据显示如下:

I have large data set from the World Bank Millenium Development goals as a CSV. The data is displayed like this:

Country Code   Country Name   Indicator
ABW            Aruba          % Forest coverage
ADO            Andorra        % Forest coverage
AFG            Afghanistan    % Forest coverage
...
ABW            Aruba          % Literacy rate
ADO            Andorra        % Literacy rate
AFG            Afghanistan    % Literacy rate
...
ABW            Aruba          % Another indicator
ADO            Andorra        % Another indicator
AFG            Afghanistan    % Another indicator

文件当前为8.2MB。我将为这些数据编写一个Web界面,并且我想按国家/地区分割数据,以便我可以发出ajax请求,以便为每个国家/地区加载单独的CSV。

The file is currently 8.2MB. I'm going to program a web interface for this data, and I'd like to slice the data by country so I can make an ajax request so I can load an individual CSV for each country.

我失去了如何以编程方式或使用任何工具。我不一定需要Python,但它是我最了解的。我不一定需要一个完整的脚本,如何处理这个问题的一般指针是赞赏。

I'm lost on how to do this programmatically or with any tool. I don't necessarily need Python but it's what I know best. I don't necessarily need a complete script, a general pointer on how to approach this problem is appreciated.

我正在使用的原始数据源位于这里:

The original data source I'm working with is located here:

http://duopixel.com/stack/ data.csv

推荐答案

一线:

awk -F "," 'NF>1 && NR>1 {print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

data_ABW 等,包含适当的信息。 NR> 1 部分跳过标题行。然后,对于每一行,将整个行( $ 0 )追加到名为 Data_ $ 1 的文件$ c>,其中 $ 1 替换为该行第一列中的文本。最后, close 语句确保没有太多打开的文件。如果你没有这么多的国家,你可以摆脱这一点,并大大提高命令的速度。

This creates new files named data_ABW, etc., containing the appropriate information. The NR>1 part skips the header line. Then, for each line, it appends that entire line ($0) to the file named Data_$1, where $1 is replaced with the text in the first column of that line. Finally, the close statement makes sure there aren't too many open files. If you didn't have so many countries, you could get rid of this and significantly increase the speed of the command.

在回答@ Lenwood的评论如下,每个输出文件中的标题,可以这样做:

In answer to @Lenwood's comment below, to include the header in each output file, you can do this:

awk -F "," 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$1]) {print header >> ("data_" $1 ".csv"); files[$1]=1}; print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

转义惊叹号...)第一个新部分 NR == 1 {header = $ 0}; 只是将输入文件的第一行存储为变量 header 。然后,另一个新部分 if(!files [$ 1])... files [$ 1] = 1}; 使用关联数组 files 以跟踪所有是否已将标题放入给定文件,如果没有,则将其放在该文件中。

(You may have to escape the exclamation point...) The first new part NR==1 {header=$0}; just stores the first line of the input file as the variable header. Then, the other new part if(! files[$1]) ... files[$1]=1}; uses the associative array files to keep track of all whether or not it has put the header into a given file, and if not, it puts it in there.

注意这附加文件,所以如果这些文件已经存在,它们只会被添加到。因此,如果您在主文件中获取新数据,您可能需要删除这些其他文件,然后再次运行此命令。

Note that this appends the files, so if those files already exist, they'll just get added to. Therefore, if you get new data in your main file, you'll probably want to delete those other files before you run this command again.

(如果不明显,如果您希望文件命名为 data_Aruba ,您可以将 $ 1 更改为 $ 2 。)

(In case it's not obvious, if you want the files to be named like data_Aruba you can change $1 to $2.)

这篇关于如何将单个CSV文件分割成由字段分组的几个较小的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆