在Ubuntu中使用bash将大.csv文件转换为.prn(大约3.5 GB) [英] Conversion of large .csv file to .prn (around 3.5 GB) in Ubuntu using bash
问题描述
我有一个.csv文件是非常大,大小约3.5 GB,因为我处理大数据,我需要转换这个文件为.prn文件,用空格分隔符分隔列。
这是文件中的示例输入值 -
UNT ,Gujarat,84716050,25669.69,UNITS,QX-870,IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N。:3402030.
FIS-0870-1004G。,INAMD4, 2015-05-01,Ahmedabad,Import,MALAYSIA,1,274
UNT ,Gujarat,84716050,25669.69,UNITS,QX-870, IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N。:3405176。
FIS-0870-1004G。,INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA, p>
UNT ,Gujarat,84716050,25669.69,UNITS,QX-870,IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N 。:3405181。
FIS-0870-1004G。,INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,276
KGS ,Gujarat,29213090,187897.88,KILOGRAMS,MEMANTINE HYDROCHLORIDE。批号。 134614003,INAMD4,W,2015-05-01,Ahmedabad,Import,ITALY,5,277
是文件的一行,您还可以观察到每个单元格以逗号分隔。但我们还可以观察到在第1行 - QX-870,IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N:3402030。FIS-0870-1004G。包含几个逗号。所以,如果我将使用逗号(,)作为分隔符,那么我将最终分离QX-870和IND BARCODE SCANNER和SW RSTR和LD和SRL + ETHNT S / N:3402030 。FIS-0870-1004G。 ,我不想要。所以,我浏览互联网,发现我们可以通过保存文件以不同的格式(我选择.prn格式解决我的问题),使用Microsoft Excel更改文件的格式,但这个伟大的工具不能转换更大文件(3.5 GB)所以,我想我的输出像这样,行no。 1行,1行,
Gujarat 84716050 25669.69 UNITSQX-870,IND
BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N。:3402030。FIS-0870-1004G。
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
274
UNT Gujarat 84716050 25669.69 UNITSQX-870,IND
BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S /N.:3405176。FIS-0870-1004G。
INAMD4 M 2015-05-01艾哈迈达巴德进口马来西亚1
275
UNT Gujarat 84716050 25669.69 UNITSQX-870,IND
BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N::3405181。FIS-0870-1004G。
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
276
KGS Gujarat 29213090 187897.88 KILOGRAMS MEMANTINE
HYDROCHLORIDE。批号。 134614003 INAMD4 W 2015-05-01
Ahmedabad Import ITALY 5 277
您的问题不清楚,因为您没有提供我们可以测试的样本输入/输出,但SOUNDS像您想要做的是这样:
$ cat tst.awk
BEGIN {
split(7 10 15 12 4,w)
FPAT =[^,] * | \[^ \] * \
}
{
gsub(/,RS)
for(i = 1; i < = NF; i ++){
gsub(//,,$ i)
gsub(RS,\,$ i)
printf >,w [i],substr($ i,1,w [i])
}
print
}
$ cat file
abcde,ab,c,de,abcde,a,b,c,ab
abcdefghi,xyab,c,de xyzabcde,abc,abcdefg
$ awk -f tst.awk文件
< abcde>< ab,c,de& cde>< a,b,c>< ab>
< abcdefg>< xyab,c,de>< xyzabcde& abc>< abcd>
显然我添加了<
code>> 每个字段只是为了清楚每个字段的开始/结束,你会删除它为你的真正的应用程序,我创建的数组
上面的代码使用了GNU awk for FPAT,其他的awks它会是一个while(match())循环。
I have a .csv file which is very large and has size about 3.5 GB, as I am dealing with big data and I need to convert this file to .prn file which seperates the columns with space delimiter.
Here is the sample input values in the file -
UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,274
UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405176. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,275
UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405181. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,276
KGS,Gujarat,29213090,187897.88,KILOGRAMS,MEMANTINE HYDROCHLORIDE. BATCH NO. 134614003,INAMD4,W,2015-05-01,Ahmedabad,Import,ITALY,5,277
Now here if you look closely each division is a row of the file and you can also observe that each of the cell is seperated by comma. But we can also observe that in row 1 - "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G." contains several commas. So, if I will use comma(,) as a delimiter then I will end up seperating "QX-870" and "IND BARCODE SCANNER" and "SW RSTR" and "LD" and "SRL+ETHNT S/N.:3402030. FIS-0870-1004G." , which I don't want. So, I browse through the internet and found out that we can can change the format of the file using Microsoft Excel by saving the file in a different format(which I choose .prn format which solved my problem) but this great tool cannot convert bigger files(3.5 GB) so, I want my output something like this i.e row no. 1 on line 1, row no. 2 pn line 2 respectively.
UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
274UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405176. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
275UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405181. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
276KGS Gujarat 29213090 187897.88 KILOGRAMS MEMANTINE HYDROCHLORIDE. BATCH NO. 134614003 INAMD4 W 2015-05-01
Ahmedabad Import ITALY 5 277
It's not clear from your question as you didn't provide sample input/output we could test against but it SOUNDS like all you're trying to do is this:
$ cat tst.awk
BEGIN {
split("7 10 15 12 4",w)
FPAT="[^,]*|\"[^\"]*\""
}
{
gsub(/""/,RS)
for (i=1;i<=NF;i++) {
gsub(/"/,"",$i)
gsub(RS,"\"",$i)
printf "<%-*s>", w[i], substr($i,1,w[i])
}
print ""
}
$ cat file
abcde,"ab,c,de","ab ""c"" de","a,""b"",c",ab
abcdefghi,"xyab,c,de","xyzab ""c"" de",abc,abcdefg
$ awk -f tst.awk file
<abcde ><ab,c,de ><ab "c" de ><a,"b",c ><ab >
<abcdefg><xyab,c,de ><xyzab "c" de ><abc ><abcd>
Obviously I added the <
and >
around each field just to make it clear where each field starts/ends, you'd remove that for your real application and I'm creating the array w
to hold specific widths for each field as idk where you get that from otherwise.
The above uses GNU awk for FPAT, with other awks it'd be a while(match()) loop.
这篇关于在Ubuntu中使用bash将大.csv文件转换为.prn(大约3.5 GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!