拆分C​​SV文件,并在输出使用bash排除列,awk或者sed [英] Splitting CSV file and excluding column in output using bash, sed or awk

查看:184
本文介绍了拆分C​​SV文件,并在输出使用bash排除列,awk或者sed的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含如数据的CSV文件如下: -

I have a CSV file which contains data like the following:-

1,275,,,275,17.3,0,"2011-05-09 20:21:45"
2,279,,,279,17.3,0,"2011-05-10 20:21:52"
3,276,,,276,17.3,0,"2011-05-11 20:21:58"
4,272,,,272,17.3,0,"2011-05-12 20:22:04"
5,272,,,272,17.3,0,"2011-05-13 20:22:10"
6,278,,,278,17.3,0,"2011-05-13 20:24:08"
7,270,,,270,17.3,0,"2011-05-13 20:24:14"
8,269,,,269,17.3,0,"2011-05-14 20:24:20"
9,278,,,278,17.3,0,"2011-05-14 20:24:26"

此文件包含4432986行数据。

This file contains 4432986 rows of data.

我要拆分出该文件在最后一栏的日期立足新的文件名。

I wish to split the file out basing the new file name on the date in the last column.

因此​​基于上述,我想与行6个新的文件,每天在每个文件中的数据。

Therefore based on the data above i would want 6 new files with the rows for each day in each file.

我想在YYYY_MM_DD格式命名的文件。

I would like the files named in YYYY_MM_DD format.

我也想忽略输出数据的第一列

I would also like to ignore the first column in the output data

所以文件2011_05_13将包含以下行,第一列排除: -

So file 2011_05_13 would contain the following rows, with the first column excluded:-

272,,,272,17.3,0,"2011-05-13 20:22:10"
278,,,278,17.3,0,"2011-05-13 20:24:08"
270,,,270,17.3,0,"2011-05-13 20:24:14"

我就准备在Linux中这样做,所以任何使用Linux实用程序什么会很酷,SED AWK等??

I am planning on doing this on a linux box, so anything using any linux utilities would be cool, sed awk etc ??

推荐答案

下面是在一个单行为您 AWK

Here's a one-liner for you in awk:

的awk -F,'{分($ 8阵,);子(\\,,数组[1]);分(NR,$ 0);分(,,$ 0);打印$ 0 GT;阵[1]}'文件。 TXT

所需的输出实现的,虽然有些也许这code的可以作出更加简洁。 HTH。

Desired output achieved, although perhaps some of this code could be made more succinct. HTH.

编辑:

阅读code,从左至右依次为:

Read code from left to right:


  • -F,结果
    是的,这设置分隔符。

  • -F ","
    Yes this sets the delimiter.

分($ 8阵,)结果
这分割空间上的第八列,并把这些信息在名为阵列的数组。

子(\\,,数组[1])结果
我们采取的第一个数组元素(这是那将成为我们的输出文件名切片),并替代了领先的符号(我们需要逃避符号所以我们把前面的 \\ 字符)。

sub ("\"","",array[1])
We take the first array element (this is a slice that's going to become our output file name) and substitute out the leading " symbol (We need to escape the " symbol so we put the \ character in front).

子(NR,,$ 0)结果
这样可方便地将删除你的文件的开头的行号( NR 是行号, $ 1,0 当然是全划界前输入的线)。

sub (NR,"",$0)
This conveniently removes the line number from the beginning of your file (NR is row number and $0 is of course the whole line of input before delimitation).

子(,,$ 0)结果
这消除了行号后面的逗号。

sub (",","",$0)
This removes the comma after the row number.

现在,我们有一个干净的文件名和数据清理排,我们可以写 $ 1,0 数组[1] 打印$ 0 GT;数组[1]

Now that we have a clean filename and a clean row of data we can write $0 to array[1]: print $0 > array[1].

FIX:

所以,如果您想preFER一个下划线,而不是一个hypon,我们需要解决的数组[1] 。我刚刚加入全局替换: GSUB( - ,_,数组[1])

So if you'd prefer a underscore instead of a hypon, all we need to fix is array[1]. I've just added in a global substitution: gsub ("-","_",array[1]).

更新code是:

的awk -F,'{分($ 8阵,);子(\\,,阵列[1]); GSUB( - ,_,阵列[1]);分(NR,$ 0);子(,, ,$ 0);打印$ 0 GT;阵[1]}'file.txt的

心连心。

这篇关于拆分C​​SV文件,并在输出使用bash排除列,awk或者sed的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆