UNIX Shell脚本解决方案用于格式化管道分隔,分段文件 [英] UNIX Shell Script Solution for formatting a pipe-delimited, segmented file
问题描述
输入文件具有相同的行内多达34个不同的记录类型。
The input file has up to 34 different record types within the same line.
该文件是管道分隔,并且每个记录类型是由'〜'分离(除始发记录类型
The file is pipe-delimited, and each record type is separated by '~' (except for the originating record type.
并非所有34记录类型都包含在每一行,我并不需要所有的人。
Not all 34 record types are contained on each line, and I do not need all of them.
所有记录类型将一个指定的顺序中被发送,但不是所有的记录类型将总是被发送。第一个记录类型是强制性的,将始终发送。出了34种,仅有7个是强制性的。
All record types will be sent within a specified order, but not all record types will always be sent. The first record type is mandatory and will always be sent. Out of the 34 types, there are only 7 that are mandatory.
每个记录类型具有字段的predefined数量和不应该从这个定义偏离未经客户和我们的负载之间适当的准备时间。
Each record type has a predefined number of fields and that should never deviate from that definition without proper lead time between the client and our load.
Oracle表将与所有基于所需的记录类型所需的列来构建。这样一行将包含类似于输入文件中的每个记录类型的信息,但将另外包含列空这将来自未包括在输入某些记录类型
The Oracle table will be constructed with all of the required columns based upon the needed record types. So one row will contain information from each record type similar to the input file, but will additionally include nulls for the columns which would come from certain record types that were not included in the input.
最终的结果我要找的是为了生成可以通过SQLLDR,而不是通过PL / SQL会简单地加载一个shell脚本内的输出执行条件格式输入文件的方式(如我想我的非PL / SQL同事是能够排除故障/修复负荷期间遇到的任何问题)。
The end result I'm looking for is a way to perform conditional formatting to the input file in order to generate an output that can be simply loaded within a shell script via sqlldr instead of going through PL/SQL (as I want my non-PL/SQL coworkers to be able to troubleshoot/fix any issues encountered during loads).
小例子有3条(数据类型不在这个例子中重要):
Small example with 3 records (data types do not matter in this example):
Record Types: AA, BB, CC, DD, EE, FF
AA has 5 fields (Mandatory)
BB has 2 fields (Optional)
CC has 3 fields (Optional)
DD has 6 fields (Optional)
EE has 4 fields (Optional)
FF has 2 fields (Not needed. Skipping in output)
GG has 4 fields (Optional)
AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~FF|P|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~CC|BCDEF|23456|~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~DD|B|C|D||~FF|Q
1号线没有问题,因为它拥有所有可用的记录类型,但2,3线没有。因此,他们将需要进行修改,以包括丢失的记录类型。整体输出就需要看是这样的:
Line 1 has no issues because it has all available record types, but lines 2 and 3 do not. So they would need to be modified to include the missing record types. The overall output would need to look something like this:
AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG|||
我已经抓住每一个记录,它分割到自己的文件,并使用启动:
I have started by taking each record, splitting it into its own file, and using:
typeset -i count=0
while read record
do
newfile="`echo $file`.$count.dat"
echo $record | sed 's/|~/\n/g' > $newfile
count=$count+1
done < $file
将每个记录类型上自己的行内说,文件,但回滚了与所有可能的领域一行present是相当棘手。这显然是不因为每个文件的最佳方式可具有几千的记录,这将导致在几千个文件,但我用,作为出发点,以获得所述逻辑向下
to put each record type on its own line within said file, but rolling it back up into one line with all possible fields present is quite tricky. This is obviously not the best way since each file can have several thousand records, which would result in several thousand files, but I was using that as a starting point to get the logic down.
有什么想法?
推荐答案
下面是一个可执行的awk脚本的解决方案,是不是完全严谨,但可以让你开始:
Here's a executable awk script solution that isn't completely rigorous, but could get you started:
#!/usr/bin/awk -f
BEGIN { FS=OFS="~" }
FNR==NR {
dflts[$1] = create_empty_field($1,$2)
if( $3 ~ /req|opt/ ) fld_order[++fld_cnt] = $1
fld_rule[$1] = $3
next
}
{
flds = ""
j = 1
for(i=1; i<=fld_cnt; i++) {
j = skip_flds( j )
if($j !~ ("^" fld_order[i])) fld = dflts[fld_order[i]]
else { fld = $j; j++ }
flds = flds (flds=="" ? "" : OFS) fld
}
print flds
}
function create_empty_field(name, cnt, fld, i) {
fld = name
for(i=1; i<=cnt; i++) { fld = fld "|" }
return( fld )
}
function skip_flds(fnum, name) {
name = $fnum
sub(/\|.*$/, "", name)
while(fld_rule[name] == "skp") {
fnum++
name = $fnum
sub(/\|.*$/, "", name)
}
return( fnum )
}
这需要用于指定每个类型的字段,我已经被称为known_flds默认设置额外的输入文件
It takes an additional input file that specifies the defaults for each type of field, which I've called "known_flds"
AA~5~req
BB~2~opt
CC~3~opt
DD~6~opt
EE~4~opt
FF~2~skp
GG~4~opt
具有相同的分隔符的数据文件,因为我不希望添加 FS
无论是在脚本或输入文件之间切换。这是你的现场要求的编码。最后一个字段的简写:
which has the same delimiter as the data file because I didn't want to add FS
switching in either the script or between the input files. It's an encoding of your field requirements. The final field is shorthand for:
- REQ - >强制性的(?在输入或输出或两者)
- 选择 - >可选(仅在输入可选)
- SKP - >跳过(输出)
在 awk.script
是由可执行文件,并运行像 ./ awk.script known_flds数据
,我得到的输出如下:
When awk.script
is made executable and run like ./awk.script known_flds data
, I get the following output:
AA|12345|ABCDE|67890|FGHIJ|~BB|12345|~CC|ABCDE|12345|~DD|A|B|C|D|E|~EE|1|2|3|~GG|F|R|T
AA|23456|BCDEF|78901|GHIJK|~BB||~CC|BCDEF|23456|~DD||||||~EE|2|3|4|~GG|R|F|G
AA|34567|CDEFG|89012|HIJKL|~BB||~CC|||~DD|B|C|D||~EE||||~GG||||
在问题的数据
The G
field in the questions data doesn't appear to either have the right number of fields specified or are missing a trailing pipe in the input data.
我由至少以下假设:
- 文件中的每个字段是正确的 - 场本身并不需要填充
- 的字段是按照正确的顺序,其中包括应跳过字段。
- 任何行可能丢失可选字段,以及任何缺失,可选字段应该出现在输出中的空场。
- 字段顺序可以从
known_flds
文件中指定。否则,我可能会挑文件的第一行是完整的,以正确的场序,以及包含作为输出所需的所有字段。这不会允许领域被称为虽然强制性的。
- Each field in the file is correct - the fields themselves don't need padding
- The fields are in the correct order, including fields that should be skipped.
- Any line might be missing the optional fields, and any missing, optional field should appear as an empty field in the output.
- The field order can be designated from the
known_flds
file. Otherwise, I might have picked the first line of the file to be complete, in correct field order as well contain as all fields required for the output. That wouldn't allow fields to be called mandatory though.
下面是脚本的一个简单的故障:
Here's a simple breakdown of the script:
-
FNR == NR
- 解析原始文件,并使用create_empty_field()创建默认空字段
功能,把结果dflts
按字段名。创建一个基本的字段顺序,它存储在fld_order
阵列。跳过的字段都放不进去fld_order
,但所有领域的规则添加到fld_rule
阵列。 - 所有线路进行检查。检查领域的秩序和只尝试打印出
fld_cnt
字段的任何记录。在known_flds
过去的行数的任何字段将不会被输出。 - 对于任何记录,跳过
选择
字段和增量Ĵ
。 - 按
附加$ J构建
或者如果它出现失踪现场,从FLDS
变量,无论是当前领域dflts
空字段。 - 打印出
FLDS
通过额外的空场,但没有跳过的字段。
FNR==NR
- parse in the original file and create default empty fields using thecreate_empty_field()
function, putting the results indflts
by field name. Create a basic field order, store it infld_order
array. Skipped fields are not put intofld_order
, but all field "rules" are added to thefld_rule
array.- All lines will be checked. Check for fields order and only attempt to print out
fld_cnt
fields for any record. Any fields past the line count inknown_flds
won't be output. - For any record, skip
opt
fields and incrementj
. - Build a
flds
variable with either the current field by$j
or if it appears to be missing a field, with an empty field fromdflts
. - Print out
flds
with the additional, empty fields but without skipped fields.
下面是功能细分
create_empty_field()
:
-
名称,CNT
从第一个文件参数,而FLD,我
的局部变量设置为空使用函数内的值。 - 设置
FLD
到名称
($ 1
从known_flds
) - 生成管道长达
CNT
值($ 2
从known_flds
)。
name, cnt
are arguments from the first file, while thefld, i
are local variables set to empty values for using within the function.- set
fld
toname
($1
fromknown_flds
) - Generate pipes up to
cnt
value ($2
fromknown_flds
).
skip_flds()
:
-
FNUM
是记录字段号的说法,而名称
是一个局部变量 - 拉
名称
部分从$ FNUM
- 检查,看它是否应与
fld_rule跳过[名] ==SKP
测试。 - 如果它应该被跳过,增量
FNUM
并重置名称
变量。 - 我想重复
名称=
和分
电话线确实应该是一个新的功能,但我没有'做T在这里。
fnum
is the argument for the record field number, whilename
is a local variable- Pull the
name
part from$fnum
- Check to see if it should be skipped with
fld_rule[name] == "skp"
test. - If it should be skipped, increment
fnum
and reset thename
variable. - I think the repeated
name =
andsub
call lines should really be a new function, but I didn't do that here.
基本上,我在 known_flds
,然后跨pretting /与执行这些awk.script 制作解析/转换规则code>对在
数据
文件记录。虽然这是一个合理的开始,你可以额外打印错误到另一个文件时manadatory字段不是present或将是空的,添加缺少的子字段的字段,等等。你可以得到你想要的那么复杂。
Basically, I'm making parsing/transformation rules in known_flds
and then interpretting/enforcing them with awk.script
against records in a data
file. While this is a reasonable start, you could additionally print errors to another file when manadatory fields are not present or would be empty, add missing subfields to fields, etc. You could get as complicated as you like.
这篇关于UNIX Shell脚本解决方案用于格式化管道分隔,分段文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!