再次提高awk的效率 [英] Make awk efficient (again)

查看:96
本文介绍了再次提高awk的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下面的代码,该代码可以成功运行(对@EdMorton表示敬意),并且用于解析,清理日志文件(非常大)并输出为较小的文件.输出文件名是每行的前2个字符.但是,如果这两个字符中有一个特殊字符,则需要将其替换为"_".这将有助于确保文件名中没有非法字符.

I have the code below, which works successfully (kudos to @EdMorton), and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

接下来,它检查是否有任何输出文件大于特定大小,如果是,则该文件被第三个字符细分.

Next, it checks if any of the output files are large than a certain size, if so, that file is sub-split by the 3rd character.

(在我的笔记本电脑上)处理1 GB的日志大约需要10分钟.可以加快速度吗?任何帮助将不胜感激.

This would take about 10 mins to process 1 GB worth of logs (on my laptop). Can this be made faster? Any help will be appreciated.

示例日志文件

"email1@foo.com:datahere2     
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com;dtat'ah'ere2 
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
email3@foo.com:datahere2

预期产量

# cat em 
email1@foo.com:datahere2     
email2@foo.com:datahere2
email3@foo.com:datahere2
email5@foo.com:dtat'ah'ere2 
email3@foo.com:datahere2

# cat _leftover
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ

代码:

#/usr/bin/env bash
Func_Clean(){
pushd $1 > /dev/null
    awk '
        {
            gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
            sub(/[,|;: \t]+/, ":")
            if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
                print
            }
            else {
                print >> "_leftover"
            }
        } 
    ' * |
    sort -t':' -k1,1 |
    awk '
        { curr = tolower(substr($0,1,2)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { 
            print >> Fpath
            # print | "gzip -9 -f >> " Fpath  # Throws an error
        } ' && rm *.txt

    find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |while read FILE; do
    awk '
        { curr = tolower(substr($0,1,3)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { 
            print >> Fpath
            # print | "gzip -9 -f >> " Fpath   # Throws an error
        } ' "$FILE" && rm "$FILE"
    done

    #gzip -9 -f -r .    # This would work, but is it effecient?
popd > /dev/null
}

### MAIN - Starting Point ###
BASE_FOLDER="_test2"
for dir in $(find $BASE_FOLDER -type d); 
do
    if [ $dir != $BASE_FOLDER ]; then
        echo $dir
        time Func_Clean "$dir"
    fi
done

推荐答案

编写主题Make awk efficient (again)-awk非常高效,您正在寻找使特定awk脚本更有效并使Shell脚本有效的方法.调用awk更加有效.

Wrt the subject Make awk efficient (again) - awk is extremely efficient, you're looking for ways to make your particular awk scripts more efficient and to make your shell script that calls awk more efficient.

我看到的唯一明显的性能改进是:

The only obvious performance improvements I see are:

  1. 更改:

find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |
while read FILE; do
    awk 'script' "$FILE" && rm "$FILE"
done

类似(未经测试):

readarray -d '' files < <(find . -type f -prune -size +1000000c \( ! -iname "_leftover" \) -print0) &&
awk 'script' "${files[@]}" &&
rm -f "${files[@]}"

因此您一次调用awk一次,而不是每个文件一次.

so you call awk once total instead of once per file.

  1. 对所有目录中的所有文件总计调用一次Func_Clean(),而不是对每个目录调用一次.

  1. Call Func_Clean() once total for all files in all directories instead of once per directory.

使用 GNU并行或类似的方法在所有目录上运行Func_Clean()并行.

Use GNU parallel or similar to run Func_Clean() on all directories in parallel.

我看到您正在考虑将输出传递到gzip来节省空间,这很好,但是请注意,这将花费您一些执行时间(idk多少钱).此外,如果这样做,则需要关闭整个输出管道,因为 是您从awk写入的内容,而不仅仅是结尾处的文件,因此您的代码将是喜欢(未经测试):

I see you're considering piping the output to gzip to save space, that's fine but just be aware that will cost you something (idk how much) in terms of execution time. Also if you do that then you need to close the whole output pipeline as that is what you're writing to from awk, not just the file at the end of it, so then your code would be something like (untested):

    { curr = tolower(substr($0,1,3)) }
    curr != prev {
        close(Fpath)
        Fpath = "gzip -9 -f >> " gensub(/[^[:alnum:]]/,"_","g",curr)
        prev = curr
    }
    { print | Fpath }

除了上面的find建议之外,这并不是为了加快处理速度,它只是清除问题中的代码,以减少冗余和常见错误(UUOC,缺少引号,读取find输出的错误方式) ,>>>的错误使用等).首先是这样的(未经测试,并假设您确实需要为每个目录分离输出文件):

This isn't intended to speed things up other than the find suggestion above, it's just a cleanup of the code in your question to reduced redundancy and common bugs (UUOC, missing quotes, wrong way to read output of find, incorrect use of >> vs >, etc.). Start with something like this (untested and assuming you do need to separate the output files for each directory):

#/usr/bin/env bash

clean_in() {
    awk '
        {
            gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
            sub(/[,|;: \t]+/, ":")
            if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
                print
            }
            else {
                print > "_leftover"
            }
        } 
    ' "${@:--}"
}

split_out() {
    local n="$1"
    shift
    awk -v n="$n" '
        { curr = tolower(substr($0,1,n)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { print > Fpath }
    ' "${@:--}"
}

Func_Clean() {
    local dir="$1"
    printf '%s\n' "$dir" >&2
    pushd "$dir" > /dev/null
    clean_in *.txt |
        sort -t':' -k1,1 |
            split_out 2 &&
    rm -f *.txt &&
    readarray -d '' big_files < <(find . -type f -prune -size +1000000c \( ! -iname "_leftover" \) -print0) &&
    split_out 3 "${big_files[@]}" &&
    rm -f "${big_files[@]}"
    popd > /dev/null
}

### MAIN - Starting Point ###
base_folder="_test2"
while IFS= read -r dir; do
    Func_Clean "$dir"
done < <(find "$base_folder" -mindepth 1 -type d)

如果我是您,那么我将在所有必要的测试/调试之后开始,然后寻找提高性能的方法.

If I were you I'd start with that (after any necessary testing/debugging) and THEN look for ways to improve the performance.

这篇关于再次提高awk的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆