在Ubuntu中使用bash将大.csv文件转换为.prn(大约3.5 GB) [英] Conversion of large .csv file to .prn (around 3.5 GB) in Ubuntu using bash

查看:479
本文介绍了在Ubuntu中使用bash将大.csv文件转换为.prn(大约3.5 GB)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.csv文件是非常大,大小约3.5 GB,因为我处理大数据,我需要转换这个文件为.prn文件,用空格分隔符分隔列。



这是文件中的示例输入值 -


UNT ,Gujarat,84716050,25669.69,UNITS,QX-870,IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N。:3402030.
FIS-0870-1004G。,INAMD4, 2015-05-01,Ahmedab​​ad,Import,MALAYSIA,1,274



UNT ,Gujarat,84716050,25669.69,UNITS,QX-870, IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N。:3405176。
FIS-0870-1004G。,INAMD4,M,2015-05-01,Ahmedab​​ad,Import,MALAYSIA, p>

UNT ,Gujarat,84716050,25669.69,UNITS,QX-870,IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N 。:3405181。
FIS-0870-1004G。,INAMD4,M,2015-05-01,Ahmedab​​ad,Import,MALAYSIA,1,276



KGS ,Gujarat,29213090,187897.88,KILOGRAMS,MEMANTINE HYDROCHLORIDE。批号。 134614003,INAMD4,W,2015-05-01,Ahmedab​​ad,Import,ITALY,5,277


是文件的一行,您还可以观察到每个单元格以逗号分隔。但我们还可以观察到在第1行 - QX-870,IND BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N:3402030。FIS-0870-1004G。包含几个逗号。所以,如果我将使用逗号(,)作为分隔符,那么我将最终分离QX-870和IND BARCODE SCANNER和SW RSTR和LD和SRL + ETHNT S / N:3402030 。FIS-0870-1004G。 ,我不想要。所以,我浏览互联网,发现我们可以通过保存文件以不同的格式(我选择.prn格式解决我的问题),使用Microsoft Excel更改文件的格式,但这个伟大的工具不能转换更大文件(3.5 GB)所以,我想我的输出像这样,行no。 1行,1行,

Gujarat 84716050 25669.69 UNITSQX-870,IND
BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N。:3402030。FIS-0870-1004G。

INAMD4 M 2015-05-01 Ahmedab​​ad Import MALAYSIA 1

274



UNT Gujarat 84716050 25669.69 UNITSQX-870,IND
BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S /N.:3405176。FIS-0870-1004G。

INAMD4 M 2015-05-01艾哈迈达巴德进口马来西亚1

275



UNT Gujarat 84716050 25669.69 UNITSQX-870,IND
BARCODE SCANNER,SW RSTR,LD,SRL + ETHNT S / N::3405181。FIS-0870-1004G。

INAMD4 M 2015-05-01 Ahmedab​​ad Import MALAYSIA 1

276



KGS Gujarat 29213090 187897.88 KILOGRAMS MEMANTINE
HYDROCHLORIDE。批号。 134614003 INAMD4 W 2015-05-01

Ahmedab​​ad Import ITALY 5 277



解决方案

您的问题不清楚,因为您没有提供我们可以测试的样本输入/输出,但SOUNDS像您想要做的是这样:

  $ cat tst.awk 
BEGIN {
split(7 10 15 12 4,w)
FPAT =[^,] * | \[^ \] * \
}
{
gsub(/,RS)
for(i = 1; i < = NF; i ++){
gsub(//,,$ i)
gsub(RS,\,$ i)
printf >,w [i],substr($ i,1,w [i])
}
print
}

$ cat file
abcde,ab,c,de,abcde,a,b,c,ab
abcdefghi,xyab,c,de xyzabcde,abc,abcdefg

$ awk -f tst.awk文件
< abcde>< ab,c,de& cde>< a,b,c>< ab>
< abcdefg>< xyab,c,de>< xyzabcde& abc>< abcd>

显然我添加了< code>> 每个字段只是为了清楚每个字段的开始/结束,你会删除它为你的真正的应用程序,我创建的数组



上面的代码使用了GNU awk for FPAT,其他的awks它会是一个while(match())循环。


I have a .csv file which is very large and has size about 3.5 GB, as I am dealing with big data and I need to convert this file to .prn file which seperates the columns with space delimiter.

Here is the sample input values in the file -

UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,274

UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405176. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,275

UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405181. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,276

KGS,Gujarat,29213090,187897.88,KILOGRAMS,MEMANTINE HYDROCHLORIDE. BATCH NO. 134614003,INAMD4,W,2015-05-01,Ahmedabad,Import,ITALY,5,277

Now here if you look closely each division is a row of the file and you can also observe that each of the cell is seperated by comma. But we can also observe that in row 1 - "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G." contains several commas. So, if I will use comma(,) as a delimiter then I will end up seperating "QX-870" and "IND BARCODE SCANNER" and "SW RSTR" and "LD" and "SRL+ETHNT S/N.:3402030. FIS-0870-1004G." , which I don't want. So, I browse through the internet and found out that we can can change the format of the file using Microsoft Excel by saving the file in a different format(which I choose .prn format which solved my problem) but this great tool cannot convert bigger files(3.5 GB) so, I want my output something like this i.e row no. 1 on line 1, row no. 2 pn line 2 respectively.

UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
274

UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405176. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
275

UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405181. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
276

KGS Gujarat 29213090 187897.88 KILOGRAMS MEMANTINE HYDROCHLORIDE. BATCH NO. 134614003 INAMD4 W 2015-05-01
Ahmedabad Import ITALY 5 277

解决方案

It's not clear from your question as you didn't provide sample input/output we could test against but it SOUNDS like all you're trying to do is this:

$ cat tst.awk
BEGIN {
    split("7 10 15 12 4",w)
    FPAT="[^,]*|\"[^\"]*\""
}
{
    gsub(/""/,RS)
    for (i=1;i<=NF;i++) {
        gsub(/"/,"",$i)
        gsub(RS,"\"",$i)
        printf "<%-*s>", w[i], substr($i,1,w[i])
    }
    print ""
}

$ cat file
abcde,"ab,c,de","ab ""c"" de","a,""b"",c",ab
abcdefghi,"xyab,c,de","xyzab ""c"" de",abc,abcdefg

$ awk -f tst.awk file
<abcde  ><ab,c,de   ><ab "c" de      ><a,"b",c     ><ab  >
<abcdefg><xyab,c,de ><xyzab "c" de   ><abc         ><abcd>

Obviously I added the < and > around each field just to make it clear where each field starts/ends, you'd remove that for your real application and I'm creating the array w to hold specific widths for each field as idk where you get that from otherwise.

The above uses GNU awk for FPAT, with other awks it'd be a while(match()) loop.

这篇关于在Ubuntu中使用bash将大.csv文件转换为.prn(大约3.5 GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆