将CSV拆分为包含设置的唯一字段值数的多个文件 [英] Split CSV to Multiple Files Containing a Set Number of Unique Field Values
问题描述
作为 awk的初学者
我可以通过
awk -F,'{print>> $ 1.csv; close($ 1)}'myfile.csv
但我想根据额外的
具体来说,输入
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0, 0
我想要的输出文件是
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0 ,0
333,1,1,1
和
444,1,1,1
444,1,0,1
555,1,1,1
666,1,0,0
每个都包含三个这种情况下)分别在第一列中的唯一值 111,222,333
和 444,555,666
。
任何帮助将不胜感激。
这将会做的技巧,我发现它很容易理解:
awk -F',''BEGIN {count = 0; filename = 1}
x [$ 1] ++ == 0 {count ++}
count == 4 {count = 1; filename ++}
{print>> filename.csv; close(filename.csv);}'文件
文件名为1.然后我们计算从第一列获得的每个唯一值,并且每当它的第四个值,我们重置我们的计数并移动到下一个文件名。
这里是我使用的一些示例数据,这只是你的一些额外的行。
〜$ cat test.txt
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1, 1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
777 ,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
101, 0,0,0
102,0,0,0
所以:
〜$ awk -F',''BEGIN {count = 0; filename = 1}
x [$ 1] ++ == 0 {count ++}
count == 4 {count = 1; filename ++}
{print>> filename.csv; close(filename.csv);}'test.txt
我们看到以下输出文件内容:
〜$ cat 1.csv
111,1,0,1
111,1 ,1,1
222,1,1,1
333,1,0,0
333,1,1,1
〜$ cat 2。 csv
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
〜$ cat 3.csv
777,1,1,1
777,1,0,1
777,1,1,0
777,1, 1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
〜$ cat 4.csv
101,0,0,0
102,0,0,0
As a beginner of awk
I am able to split the data with unique value by
awk -F, '{print >> $1".csv";close($1)}' myfile.csv
But I would like to split a large CSV file based on additional condition which is the occurrences of unique values in a specific column.
Specifically, with input
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
I would like the output files to be
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
and
444,1,1,1
444,1,0,1
555,1,1,1
666,1,0,0
each of which contains three(in this case) unique values, 111,222,333
and 444,555,666
respectively, in first column.
Any help would be appreciated.
This will do the trick and I find it pretty readable and easy to understand:
awk -F',' 'BEGIN { count=0; filename=1 }
x[$1]++==0 {count++}
count==4 { count=1; filename++}
{print >> filename".csv"; close(filename".csv");}' file
We start with our count at 0 and our filename at 1. We then count each unique value we get from the fist column, and whenever its the 4th one, we reset our count and move to the next filename.
Here's some sample data I used, which is just yours with some additional lines.
~$ cat test.txt
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
101,0,0,0
102,0,0,0
And running the awk like so:
~$ awk -F',' 'BEGIN { count=0; filename=1 }
x[$1]++==0 {count++}
count==4 { count=1; filename++}
{print >> filename".csv"; close(filename".csv");}' test.txt
We see the following output files and content:
~$ cat 1.csv
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
~$ cat 2.csv
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
~$ cat 3.csv
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
~$ cat 4.csv
101,0,0,0
102,0,0,0
这篇关于将CSV拆分为包含设置的唯一字段值数的多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!