UNIX Shell脚本解决方案用于格式化管道分隔,分段文件 [英] UNIX Shell Script Solution for formatting a pipe-delimited, segmented file

查看:247
本文介绍了UNIX Shell脚本解决方案用于格式化管道分隔,分段文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

输入文件具有相同的行内多达34个不同的记录类型。

The input file has up to 34 different record types within the same line.

该文件是管道分隔,并且每个记录类型是由'〜'分离(除始发记录类型

The file is pipe-delimited, and each record type is separated by '~' (except for the originating record type.

并非所有34记录类型都包含在每一行,我并不需要所有的人。

Not all 34 record types are contained on each line, and I do not need all of them.

所有记录类型将一个指定的顺序中被发送,但不是所有的记录类型将总是被发送。第一个记录类型是强制性的,将始终发送。出了34种,仅有7个是强制性的。

All record types will be sent within a specified order, but not all record types will always be sent. The first record type is mandatory and will always be sent. Out of the 34 types, there are only 7 that are mandatory.

每个记录类型具有字段的predefined数量和不应该从这个定义偏离未经客户和我们的负载之间适当的准备时间。

Each record type has a predefined number of fields and that should never deviate from that definition without proper lead time between the client and our load.

Oracle表将与所有基于所需的记录类型所需的列来构建。这样一行将包含类似于输入文件中的每个记录类型的信息,但将另外包含列空这将来自未包括在输入某些记录类型

The Oracle table will be constructed with all of the required columns based upon the needed record types. So one row will contain information from each record type similar to the input file, but will additionally include nulls for the columns which would come from certain record types that were not included in the input.

最终的结果我要找的是为了生成可以通过SQLLDR,而不是通过PL / SQL会简单地加载一个shell脚本内的输出执行条件格式输入文件的方式(如我想我的非PL / SQL同事是能够排除故障/修复负荷期间遇到的任何问题)。

The end result I'm looking for is a way to perform conditional formatting to the input file in order to generate an output that can be simply loaded within a shell script via sqlldr instead of going through PL/SQL (as I want my non-PL/SQL coworkers to be able to troubleshoot/fix any issues encountered during loads).

小例子有3条(数据类型不在这个例子中重要):

Small example with 3 records (data types do not matter in this example):

Record Types:  AA, BB, CC, DD, EE, FF  
AA has 5 fields (Mandatory)  
BB has 2 fields (Optional)  
CC has 3 fields (Optional)  
DD has 6 fields (Optional)  
EE has 4 fields (Optional)  
FF has 2 fields (Not needed.  Skipping in output)  
GG has 4 fields (Optional)


AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~FF|P|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~CC|BCDEF|23456|~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~DD|B|C|D||~FF|Q

1号线没有问题,因为它拥有所有可用的记录类型,但2,3线没有。因此,他们将需要进行修改,以包括丢失的记录类型。整体输出就需要看是这样的:

Line 1 has no issues because it has all available record types, but lines 2 and 3 do not. So they would need to be modified to include the missing record types. The overall output would need to look something like this:

AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG|||

我已经抓住每一个记录,它分割到自己的文件,并使用启动:

I have started by taking each record, splitting it into its own file, and using:

typeset -i count=0
while read record
do
newfile="`echo $file`.$count.dat"
echo $record | sed 's/|~/\n/g' > $newfile
count=$count+1
done < $file 

将每个记录类型上自己的行内说,文件,但回滚了与所有可能的领域一行present是相当棘手。这显然​​是不因为每个文件的最佳方式可具有几千的记录,这将导致在几千个文件,但我用,作为出发点,以获得所述逻辑向下

to put each record type on its own line within said file, but rolling it back up into one line with all possible fields present is quite tricky. This is obviously not the best way since each file can have several thousand records, which would result in several thousand files, but I was using that as a starting point to get the logic down.

有什么想法?

推荐答案

下面是一个可执行的awk脚本的解决方案,是不是完全严谨,但可以让你开始:

Here's a executable awk script solution that isn't completely rigorous, but could get you started:

#!/usr/bin/awk -f

BEGIN { FS=OFS="~" }

FNR==NR {
    dflts[$1] = create_empty_field($1,$2)
    if( $3 ~ /req|opt/ ) fld_order[++fld_cnt] = $1
    fld_rule[$1] = $3
    next
}

{
    flds = ""
    j = 1
    for(i=1; i<=fld_cnt; i++) {
        j = skip_flds( j )

        if($j !~ ("^" fld_order[i])) fld = dflts[fld_order[i]]
        else { fld = $j; j++ }
        flds = flds (flds=="" ? "" : OFS) fld
    }
    print flds
}

function create_empty_field(name, cnt,     fld, i) {
    fld = name
    for(i=1; i<=cnt; i++) { fld = fld "|" }
    return( fld )
}

function skip_flds(fnum,     name) {
    name = $fnum
    sub(/\|.*$/, "", name)
    while(fld_rule[name] == "skp") {
        fnum++
        name = $fnum
        sub(/\|.*$/, "", name)
    }
    return( fnum )
}

这需要用于指定每个类型的字段,我已经被称为known_flds默认设置额外的输入文件

It takes an additional input file that specifies the defaults for each type of field, which I've called "known_flds"

AA~5~req
BB~2~opt
CC~3~opt
DD~6~opt
EE~4~opt
FF~2~skp
GG~4~opt

具有相同的分隔符的数据文件,因为我不希望添加 FS 无论是在脚本或输入文件之间切换。这是你的现场要求的编码。最后一个字段的简写:

which has the same delimiter as the data file because I didn't want to add FS switching in either the script or between the input files. It's an encoding of your field requirements. The final field is shorthand for:


  • REQ - >强制性的(?在输入或输出或两者)

  • 选择 - >可选(仅在输入可选)

  • SKP - >跳过(输出)

awk.script 是由可执行文件,并运行像 ./ awk.script known_flds数据,我得到的输出如下:

When awk.script is made executable and run like ./awk.script known_flds data, I get the following output:

AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG||||

在问题的数据字段不会出现要么有指定的字段的正确数量或失踪在输入数据的尾随管道。

The G field in the questions data doesn't appear to either have the right number of fields specified or are missing a trailing pipe in the input data.

我由至少以下假设:


  • 文件中的每个字段是正确的 - 场本身并不需要填充

  • 的字段是按照正确的顺序,其中包括应跳过字段。

  • 任何行可能丢失可选字段,以及任何缺失,可选字段应该出现在输出中的空场。

  • 字段顺序可以从 known_flds 文件中指定。否则,我可能会挑文件的第一行是完整的,以正确的场序,以及包含作为输出所需的所有字段。这不会允许领域被称为虽然强制性的。

  • Each field in the file is correct - the fields themselves don't need padding
  • The fields are in the correct order, including fields that should be skipped.
  • Any line might be missing the optional fields, and any missing, optional field should appear as an empty field in the output.
  • The field order can be designated from the known_flds file. Otherwise, I might have picked the first line of the file to be complete, in correct field order as well contain as all fields required for the output. That wouldn't allow fields to be called mandatory though.

下面是脚本的一个简单的故障:

Here's a simple breakdown of the script:


  • FNR == NR - 解析原始文件,并使用 create_empty_field()创建默认空字段功能,把结果 dflts 按字段名。创建一个基本的字段顺序,它存储在 fld_order 阵列。跳过的字段都放不进去 fld_order ,但所有领域的规则添加到 fld_rule 阵列。

  • 所有线路进行检查。检查领域的秩序和只尝试打印出 fld_cnt 字段的任何记录。在 known_flds 过去的行数的任何字段将不会被输出。

  • 对于任何记录,跳过选择字段和增量Ĵ

  • 附加$ J构建 FLDS 变量,无论是当前领域或者如果它出现失踪现场,从 dflts 空字段。

  • 打印出 FLDS 通过额外的空场,但没有跳过的字段。

  • FNR==NR - parse in the original file and create default empty fields using the create_empty_field() function, putting the results in dflts by field name. Create a basic field order, store it in fld_order array. Skipped fields are not put into fld_order, but all field "rules" are added to the fld_rule array.
  • All lines will be checked. Check for fields order and only attempt to print out fld_cnt fields for any record. Any fields past the line count in known_flds won't be output.
  • For any record, skip opt fields and increment j.
  • Build a flds variable with either the current field by $j or if it appears to be missing a field, with an empty field from dflts.
  • Print out flds with the additional, empty fields but without skipped fields.

下面是功能细分

create_empty_field()


  • 名称,CNT 从第一个文件参数,而 FLD,我​​的局部变量设置为空使用函数内的值。

  • 设置 FLD 名称 $ 1 known_flds

  • 生成管道长达 CNT 值( $ 2 known_flds )。

  • name, cnt are arguments from the first file, while the fld, i are local variables set to empty values for using within the function.
  • set fld to name ( $1 from known_flds )
  • Generate pipes up to cnt value ( $2 from known_flds ).

skip_flds()


  • FNUM 是记录字段号的说法,而名称是一个局部变量

  • 名称部分从 $ FNUM

  • 检查,看它是否应与 fld_rule跳过[名] ==SKP测试。

  • 如果它应该被跳过,增量 FNUM 并重置名称变量。

  • 我想重复名称= 电话线确实应该是一个新的功能,但我没有'做T在这里。

  • fnum is the argument for the record field number, while name is a local variable
  • Pull the name part from $fnum
  • Check to see if it should be skipped with fld_rule[name] == "skp" test.
  • If it should be skipped, increment fnum and reset the name variable.
  • I think the repeated name = and sub call lines should really be a new function, but I didn't do that here.

基本上,我在 known_flds ,然后跨pretting /与执行这些awk.script 对在数据文件记录。虽然这是一个合理的开始,你可以额外打印错误到另一个文件时manadatory字段不是present或将是空的,添加缺少的子字段的字段,等等。你可以得到你想要的那么复杂。

Basically, I'm making parsing/transformation rules in known_flds and then interpretting/enforcing them with awk.script against records in a data file. While this is a reasonable start, you could additionally print errors to another file when manadatory fields are not present or would be empty, add missing subfields to fields, etc. You could get as complicated as you like.

这篇关于UNIX Shell脚本解决方案用于格式化管道分隔,分段文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆