有没有更快的方法来截断 Unix 中的列 [英] Is there any faster way to truncate column in Unix
问题描述
我想在 Unix 中将 TSV 文件的第 4 列截断为给定的长度.文件有数百万条记录,大小为 8GB.
I want to truncate 4th column of TSV file to given length in Unix. File has records in few millions and is of size 8GB.
我正在尝试这个,但它似乎有点慢.
I am trying this but it seems to be kind of slow.
awk -F"\t" '{s=substr($4,0,256); print $1"\t"$2"\t"$3"\t"s"\t"$5"\t"$6"\t"$7}' file > newFile
有没有更快的替代方案?
Is there any faster alternatives for same?
谢谢
推荐答案
你的命令可以写得更好一点(假设你正在重建记录),这可能会提高一些性能:
Your command could be written a little more nicely (assuming you are re-building the record), which may give some performance increases:
awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,256) }' file > newFile
如果您可以访问多核机器(您可能会这样做),您可以使用 GNU平行.您可能想要改变您使用的内核数量(我在此处设置了 4 个)以及提供给 awk
的块大小(我已将其设置为 2 兆字节)...
If you have access to a multi-core machine (which you probably do), you can use GNU parallel. You may want to vary the number of cores you use (I've set 4 here) and the block size that's fed to awk
(I've set this to two megabytes)...
< file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' > newFile
<小时><小时>
以下是我使用 2.7G 文件、1 亿行和 2M 块大小在我的系统上进行的一些测试:
Here's some testing I did on my system using a 2.7G file with 100 million lines and a block size of 2M:
time awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' file >/dev/null
结果:
real 1m59.313s
user 1m57.120s
sys 0m2.190s
单核:
time < file parallel -j 1 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
结果:
real 2m28.270s
user 4m3.070s
sys 0m41.560s
四核:
time < file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
结果:
real 0m54.329s
user 2m41.550s
sys 0m31.460s
十二核:
time < file parallel -j 12 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
结果:
real 0m36.581s
user 2m24.370s
sys 0m32.230s
这篇关于有没有更快的方法来截断 Unix 中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!