使用awk处理多个文件 [英] Process multiple file using awk
问题描述
我必须使用awk处理许多txt文件(每个文件有1600万行).例如,我必须阅读十个文件:
I've got to process lots of txt files (16 million of rows for each file) using awk. I've got to read for example ten files:
文件1:
en sample_1 200
en.n sample_2 10
en sample_3 10
文件#2:
en sample_1 10
en sample_3 67
文件#3:
en sample_1 1
en.n sample_2 10
en sample_4 20
...
我想要这样的输出:
源标题f1 f2 f3 sum(f1,f2,f3)
source title f1 f2 f3 sum(f1,f2,f3)
en sample_1 200 10 1 211
en.n sample_2 10 0 10 20
en sample_3 10 67 0 77
en sample_4 0 0 20 20
这是我的第一个代码版本:
Here my first version of code:
#! /bin/bash
clear
#var declaration
BASEPATH=<path_to_file>
YEAR="2014"
RES_FOLDER="processed"
FINAL_RES="2014_06_01"
#results folder creation
mkdir $RES_FOLDER
#processing
awk 'NF>0{a[$1" "$2]=a[$1" "$2]" "$3}END{for(i in a){print i a[i]}}' $BASEPATH/$YEAR/* > $RES_FOLDER/$FINAL_RES
这是我的输出:
en sample_1 200 10 1
en.n sample_2 10 10
en sample_3 10 67
en sample_4 20
我对如何将零列放置在未发现任何位置以及如何获取所有值的总和上感到有些困惑.我知道我必须使用这个:
I'm a little bit confused about how to put zero column where no occurrence is found and how to get the sum of all value. I know I've to use this:
{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}
希望有人会帮助您.谢谢.
Hope someone will help. Thank you.
********已编辑********
******** EDITED ********
我正在尝试以另一种方式实现自己的目标.我创建了一个像这样的bash脚本,它会生成一个包含我所有键的排序文件,它非常庞大,大约有6200万条记录,我将该文件切成小块,然后将每个小块传递给我的awk脚本.
I'm trying to achieve my result in a different kind of way. I create a bash script like this, It produces a sorted file with all of my keys, it's very huge, about 62 millions of record, I slice this file into pieces and I pass each piece to my awk script.
重击:
#! /bin/bash
clear
FILENAME=<result>
BASEPATH=<base_path>
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 > $BASEPATH/processed/aggr
sort -u -k2 $BASEPATH/processed/aggr > $BASEPATH/processed/sorted
split -d -l 1000000 processed/sorted processed/slice/slice-
echo $(date "+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
for filename in processed/slice/*; do
awk -v filename="$filename" -f algorithm.awk dataset/* >> processed/$FILENAME
done
echo $(date "+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
rm $BASEPATH/processed/aggr
rm $BASEPATH/processed/sorted
rm -rf $BASEPATH/processed/slice
AWK:
BEGIN{
while(getline < filename){
key=$1" "$2;
sources[key];
for(i=1;i<11;i++){
keys[key"-"i] = "0";
}
}
close(filename);
}
{
if(FNR==1){
ARGIND++;
}
key=$1" "$2;
keys[key"-"ARGIND] = $3
}END{
for (s in sources) {
sum = 0
printf "%s", s
for (j=1;j<11;j++) {
printf "%s%s", OFS, keys[s"-"j]
sum += keys[s"-"j]
}
print " "sum
}
}
使用awk预先分配我的最终数组,并读取 dataset/*
文件夹,以填充其内容.我发现我的瓶颈来自awk输入(每个文件有16.000.000行的10个文件)在数据集文件夹上的迭代.一切都在处理一小部分数据,但由于实际数据,RAM(30GB)拥塞.有人有任何建议吗?谢谢.
With awk I preallocate my final array, and reading dataset/*
folder I populate its content.
I've figured out that my bottleneck came from iterating on dataset folder by awk input (10 files with 16.000.000 lines each).
Everything is working on a small set of data, but with real data, RAM (30GB) congested. Does anyone have any suggestions or advices? Thank you.
推荐答案
$ cat tst.awk
{
key = $1" "$2
keys[key]
val[key,ARGIND] = $3
}
END {
for (key in keys) {
sum = 0
printf "%s", key
for (fileNr=1;fileNr<=ARGIND;fileNr++) {
printf "%s%s", OFS, val[key,fileNr]+0
sum += val[key,fileNr]
}
print sum
}
}
$ awk -f tst.awk file1 file2 file3
en sample_4 0 0 2020
en.n sample_2 10 0 1020
en sample_1 200 10 1211
en sample_3 10 67 077
上面的代码将GNU awk用于ARGIND,而其他awks只需在开始处添加一行 FNR == 1 {ARGIND ++}
.如有必要,将输出通过管道传递到 sort
.
The above uses GNU awk for ARGIND, with other awks just add a line FNR==1{ARGIND++}
at the start. Pipe the output to sort
if necessary.
这篇关于使用awk处理多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!