基于列的值用头和打印输出文件AWK拆分大的CSV文件 [英] AWK Split large CSV file with headers and print output files based on column value
问题描述
我身边有800 MB,我需要分割使用AWK的CSV文件。
该文件具有的ID在其中,我想用分割的文件中的列。
我熟悉/了解如何使用Perl做到这一点,但不与AWK,因为我只用了几次。
(在Perl中我会使用Text :: CSV模块,但我并没有在这种情况下,选购件)
我发现了这样的回答: http://stackoverflow.com/a/16795137 这基本上是我想要什么,但有小改变。它包含一个if语句,所以如果我想拆它的列是一个数字它只会打印。这是必要的,因为文件栏有时会转变,我想非数字线发送到一个单独的文件(junk.csv)。
我使用CMD版本进行测试,现在的窗户,但我最终会在Linux上运行它。 (下面原来的code)
awk的-F,NR == 1 {HDR = $ 0;接下来}!($ 3档){文件[$ 3] = 1;打印HDR \\\\ n \\ > $ 3 \\} {打印&GTCSV \\;。CSV \\$ 3 \\}test.csv
和我的意图是这样的:
的awk -F; {如果($ 3〜/ ^ [0-9] + $ /){NR == 1 {HDR = $ 0;!接下来}($ 3档){文件[$ 3] = 1;打印HDR> $ 3 \\ .CSV \\} {打印> $ 3 \\。CSV \\}test.csv
我无法弄清楚如何在AWK做到这一点(只是还没有)。双引号也扔我送行(因为Windows版本)。我在哪里去了?
这是我的错误输出:
AWK:{如果($ 3〜/ ^ [0-9] + $ /)NR == 1 {HDR = $ 0;接下来}(在文件$ 3){文件[! $ 3] = 1;打印HDR> $ 3的.csv} {打印> $ 3.csv};否则打印> junk.csv}
AWK:^语法错误
AWK:{如果($ 3〜/ ^ [0-9] + $ /)NR == 1 {HDR = $ 0;接下来}($ 3档){文件[$ 3] = 1;打印HDR> $ 3的.csv } {打印> $ 3.csv};否则打印> junk.csv}
AWK:^语法错误
AWK:{如果($ 3〜/ ^ [0-9] + $ /)NR == 1 {HDR = $ 0;接下来}($ 3档){文件[$ 3] = 1;打印HDR> $ 3的.csv } {打印> $ 3.csv};否则打印> junk.csv}
AWK:^语法错误
AWK:{如果($ 3〜/ ^ [0-9] + $ /)NR == 1 {HDR = $ 0;接下来}($ 3档){文件[$ 3] = 1;打印HDR> $ 3的.csv } {打印> $ 3.csv};否则打印> junk.csv}
AWK:^语法错误
errcount:4
这是我的(样品)数据:
10002394; 22.98; 48; HTTP://testdata.com/bla/29012827.jpg; 5.95; 93962094820
10003062; 19.99; 26; HTTP://testdata.com/bla/29002816.jpg; 5.95; 17012725049
10003122; 13.0; 53; HTTP://testdata.com/bla/29019899.jpg; 5.95; 24404000059
10004766; 12.99; 48; HTTP://testdata.com/bla/29007085.jpg; 5.95; 95074666117
10007645; 20.99; 65; HTTP://testdata.com/bla/28798580.jpg; 5.95; 10201848233
10009363; 119.0; 53; HTTP://testdata.com/bla/29004907.jpg; 5.95; 9823036360
10009631; 19.95; 48; HTTP://testdata.com/bla/29013097.jpg; 5.95; 20689058198
10010119; 9.99; 48; HTTP://testdata.com/bla/29016592.jpg; 5.95; 80076014280
10012615; 20.99; 53; HTTP://testdata.com/bla/28772382.jpg; 5.95; 3948187983
10015250; 14.99; 48; HTTP://testdata.com/bla/29015812.jpg; 5.95; 93962045440
10019190; 69.99; 53; HTTP://testdata.com/bla/29010968.jpg; 5.95; 948187983
10025155; 27.99; 65; HTTP://testdata.com/bla/29011075.jpg; 5.95; 14201021349
10025825; 12.99; 65; HTTP://testdata.com/bla/29017837.jpg; 5.95; 93962025367
10029650; 27.99; 48; HTTP://testdata.com/bla/29003007.jpg; 5.95; 3692164452
10034957; 34.99; 53; HTTP://testdata.com/bla/29000529.jpg; 5.95; 42872898825
10041967; 24.99; 65; HTTP://testdata.com/bla/28781700.jpg; 5.95; 91229911080
10045277; 59.99; 65; HTTP://testdata.com/bla/29010583.jpg; 5.95; 67365082290
10045795; 10.99; 48; HTTP://testdata.com/bla/29002819.jpg; 5.95; 19422308188
10048375; 26.99; 26; HTTP://testdata.com/bla/29002270.jpg; 5.95; 95082912275
10052550; 19.99; 48; HTTP://testdata.com/bla/29016347.jpg; 5.95; 7368425436
和我想做到这一点:
文件 - > 26.csv
10003062; 19.99; 26; HTTP://testdata.com/bla/29002816.jpg; 5.95; 17012725049
10048375; 26.99; 26; HTTP://testdata.com/bla/29002270.jpg; 5.95; 95082912275文件 - > 48.csv
10002394; 22.98; 48; HTTP://testdata.com/bla/29012827.jpg; 5.95; 93962094820
10004766; 12.99; 48; HTTP://testdata.com/bla/29007085.jpg; 5.95; 95074666117
10009631; 19.95; 48; HTTP://testdata.com/bla/29013097.jpg; 5.95; 20689058198
10010119; 9.99; 48; HTTP://testdata.com/bla/29016592.jpg; 5.95; 80076014280
10015250; 14.99; 48; HTTP://testdata.com/bla/29015812.jpg; 5.95; 93962045440
10029650; 27.99; 48; HTTP://testdata.com/bla/29003007.jpg; 5.95; 3692164452
10045795; 10.99; 48; HTTP://testdata.com/bla/29002819.jpg; 5.95; 19422308188
10052550; 19.99; 48; HTTP://testdata.com/bla/29016347.jpg; 5.95; 7368425436文件 - > 53.csv
10003122; 13.0; 53; HTTP://testdata.com/bla/29019899.jpg; 5.95; 24404000059
10009363; 119.0; 53; HTTP://testdata.com/bla/29004907.jpg; 5.95; 9823036360
10012615; 20.99; 53; HTTP://testdata.com/bla/28772382.jpg; 5.95; 3948187983
10019190; 69.99; 53; HTTP://testdata.com/bla/29010968.jpg; 5.95; 948187983
10034957; 34.99; 53; HTTP://testdata.com/bla/29000529.jpg; 5.95; 42872898825文件 - > 65.csv
10007645; 20.99; 65; HTTP://testdata.com/bla/28798580.jpg; 5.95; 10201848233
10025155; 27.99; 65; HTTP://testdata.com/bla/29011075.jpg; 5.95; 14201021349
10025825; 12.99; 65; HTTP://testdata.com/bla/29017837.jpg; 5.95; 93962025367
10041967; 24.99; 65; HTTP://testdata.com/bla/28781700.jpg; 5.95; 91229911080
10045277; 59.99; 65; HTTP://testdata.com/bla/29010583.jpg; 5.95; 67365082290
您可以简化 AWK
为
的awk -F \\; {打印> $ 3的.csv}'输入
将产生以下的内容 CSV
文件
26.csv
10003062; 19.99; 26; HTTP://testdata.com/bla/29002816.jpg; 5.95; 17012725049
10048375; 26.99; 26; HTTP://testdata.com/bla/29002270.jpg; 5.95; 95082912275
48.csv
10002394; 22.98; 48; HTTP://testdata.com/bla/29012827.jpg; 5.95; 93962094820
10004766; 12.99; 48; HTTP://testdata.com/bla/29007085.jpg; 5.95; 95074666117
10009631; 19.95; 48; HTTP://testdata.com/bla/29013097.jpg; 5.95; 20689058198
10010119; 9.99; 48; HTTP://testdata.com/bla/29016592.jpg; 5.95; 80076014280
10015250; 14.99; 48; HTTP://testdata.com/bla/29015812.jpg; 5.95; 93962045440
10029650; 27.99; 48; HTTP://testdata.com/bla/29003007.jpg; 5.95; 3692164452
10045795; 10.99; 48; HTTP://testdata.com/bla/29002819.jpg; 5.95; 19422308188
10052550; 19.99; 48; HTTP://testdata.com/bla/29016347.jpg; 5.95; 7368425436
53.csv
10003122; 13.0; 53; HTTP://testdata.com/bla/29019899.jpg; 5.95; 24404000059
10009363; 119.0; 53; HTTP://testdata.com/bla/29004907.jpg; 5.95; 9823036360
10012615; 20.99; 53; HTTP://testdata.com/bla/28772382.jpg; 5.95; 3948187983
10019190; 69.99; 53; HTTP://testdata.com/bla/29010968.jpg; 5.95; 948187983
10034957; 34.99; 53; HTTP://testdata.com/bla/29000529.jpg; 5.95; 42872898825
65.csv
10007645; 20.99; 65; HTTP://testdata.com/bla/28798580.jpg; 5.95; 10201848233
10025155; 27.99; 65; HTTP://testdata.com/bla/29011075.jpg; 5.95; 14201021349
10025825; 12.99; 65; HTTP://testdata.com/bla/29017837.jpg; 5.95; 93962025367
10041967; 24.99; 65; HTTP://testdata.com/bla/28781700.jpg; 5.95; 91229911080
10045277; 59.99; 65; HTTP://testdata.com/bla/29010583.jpg; 5.95; 67365082290
注意
如果你想发送的有3列非位数 junk.csv
在上面AWK小改线可以是有益的。
的awk -F \\; '$ 3〜/ ^ [0-9] + $ / {打印> $ 3.csv文件;接下来} {打印> junk.csv}'输入
-
$ 3〜/ ^ [0-9] + $ /
在执行第3列正则表达式匹配,如果匹配,发送给相应的CSV文件。否则该行被写入junk.csv
或
一个更简单的版本,像
的awk -F \\; '{文件= $ 3〜/ ^ [0-9] + / $ 3:垃圾;打印>文件名为.csv}
由于Jidder的意见。
I have a CSV file of around 800 mb which I need to split up using AWK. The file has a column with ID's in them which I want to use to split the file on. I'm familiar/know how to accomplish this with Perl but not with AWK since I've only used it a few times. (In perl I would use the Text::CSV module but I don't have the option in this case)
I found this answer: http://stackoverflow.com/a/16795137 which is basically what I want but with a small alteration. It has to contain an if statement so it will only print if the column I want to split it on is a digit. This is necessary because the file column can shift sometimes and I want to send the non-digit lines to a seperate file (junk.csv).
I'm using the windows cmd version for testing right now but I'll eventually run it on linux. (Below original code)
awk -F, "NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr\"\n\">$3\".csv\"}{print>$3\".csv\"}" test.csv
And my intention is this:
awk -F";" "{if ($3 ~ /^[0-9]+$/){"NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3\".csv\"}{print>$3\".csv\"}"" test.csv
I can't figure out how to do this in AWK (just yet). The double quotes are also throwing me off (because of the windows version). Where am I going wrong?
This is my error output:
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
errcount: 4
This is my (sample) data:
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
And I want to accomplish this:
File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
You can simplify the awk
as
awk -F\; '{print > $3".csv"}' input
Will produce the following csv
files with content
26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
NOTE
If you want to send the lines which have non digits in column 3 to junk.csv
a small change in the above awk can be helpfull
awk -F\; '$3 ~ /^[0-9]+$/{print > $3".csv"; next} {print > "junk.csv"}' input
$3 ~ /^[0-9]+$/
performs a regex match on column 3 and if it matches, sends to corresponding csv file. else the line is written tojunk.csv
OR
a much simpler version like
awk -F\; '{file=$3~/^[0-9]+$/?$3:"junk";print >file".csv"}'
Thanks to Jidder for the suggestion.
这篇关于基于列的值用头和打印输出文件AWK拆分大的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!