分割出一个大文件 [英] Splitting out a large file

查看：63 发布时间：2021/5/9 20:52:35 bash awk sed

本文介绍了分割出一个大文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用以下行处理200 GB的文件:

I would like to process a 200 GB file with lines like the following:

...
{"captureTime": "1534303617.738","ua": "..."}
...

目标是将此文件拆分为按小时分组的多个文件.

The objective is to split this file into multiple files grouped by hours.

这是我的基本脚本:

#!/bin/sh

echo "Splitting files"

echo "Total lines"
sed -n '$=' $1

echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

while read p; do
  date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '@{}' '+%Y%m%d%H')
  echo $p >> split.$date
done <$1

一些事实:

要处理的80,000,000行
jq 不能正常工作，因为某些JSON行无效.

80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.

您能帮助我优化此bash脚本吗?

Could you help me to optimize this bash script?

谢谢

推荐答案

以下awk解决方案可能会助您一臂之力:

This awk solution might come to your rescue:

awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1

它基本上取代了 while -循环.

It essentially replaces your while-loop.

此外，您可以将完整的脚本替换为:

Furthermore, you can replace the complete script with:

# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
  print "Total lines processed: ", NR
  print "First date: "strftime("%Y%m%d%H",tmin)
  print "Last date:  "strftime("%Y%m%d%H",tmax)
}

随后您可以运行为:

awk -f <awk_file.awk> <jq-file>

注意: strftime 的使用表明您需要使用GNU awk.

Note: the usage of strftime indicates that you need to use GNU awk.

这篇关于分割出一个大文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分割出一个大文件 [英] Splitting out a large file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

分割出一个大文件 [英] Splitting out a large file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭