如何使用SED / AWK解析一个文件的内容? [英] How to parse contents of a file using sed/awk?
问题描述
我的输入文件有其格式如下内容,其中每列由一个空间
分离<$p$p><$c$c>string1<space>string2<space>string3<space>YYYY-mm-dd<space>hh:mm:ss.SSS<space>string4<space>10:1234567890<space>0e:Apple 1.2.3.4&lt;空&GT;&lt;空&GT;&STRING5 lt;空&GT; HEX有2空间后,0E:苹果1.2.3.4,因为在这个领域/列中没有14位数。整个0E:苹果1.2.3.4space被视为该列的单个值。
在第7列, 10 重新presents在下面的字符串中的字符数。
在第8栏, 0E:重新presents的14十六进制的值,所以,十六进制值提的字符个数后面的字符串中的
我爱:
0E:苹果1.2.3.4 - &GT;这是8列中的实际值没有
(我已经提到的,显示第14位为空)它算作
0E:在P P Lë1。 2。 3。 4
| | | | | | | | | | | | | |
1 2 3 4 5 6 7 8 9 10 11 12 1314
让我们考虑第一行输入文件如下:
字符串1字符串2 STRING3 YYYY-MM-DD 23:50:45.999串,10:1234567890 0E:苹果1.2.3.4 STRING5 001E
其中:
-
字符串1
是第1列的值 -
字符串2
在第二列中的值 -
STRING3
是第3列中的值 -
YYYY-MM-DD
4中 -
23:50:50.999
在5日 -
STRING3
在6 -
10:1234567890
在7 //有结尾没有空间,因为它有10个数字 -
0E:苹果1.2.3.4第8 //空间
末 -
STRING5
在9日 -
001E
第十
期望的输出:
字符串1,字符串,STRING3,YYYY-MM DD,23:50:50.999,string3,1234567890,Apple_1.2.3.4,string5,30
要求:
- 消除距离第7和第8列计数(
10
&放大器;0E:
) - B /空间是W
苹果
和1.2.3.4
应该是替换_ - 在最后一列的十六进制值应转换为十进制值。
- 替换为列之间的空间,
- 我只在第10列中使用十六进制值在这里。如果它在几列呢?任何方式将其转换为特定的某些列?
我已经使用这个尝试:
$猫input.txt的| sed的'S / [A-Z0-9] *:// G'
这使得输出:
字符串1,字符串,STRING3,YYYY-MM-DD,45.999,string4,1234567890,苹果,1.2.3.4,string5,001e
这会做你想要什么你的例子输入:
的awk -F[]'{子(/.*:/,,$ 7)子(/.*:/,,$ 8); printf的%S%S%S,%S,%S,%S%S%S_%S,%S,%S%D \\ n,$ 1,$ 2,$ 3,$ 4,$ 5 $ 6,$ 7 $ 8 $ 9 $ 10,$ 10,0X$ 12}'input.txt中
部分的说明:
AWK
的 的printf
允许您指定的输出格式,所以你可以手动指定要界定哪些字段与,
,并要与来划定_
。
-F[]
强制字段分隔符是一个空格,以便它知道有两个单空间之间的空场。默认行为是允许多个空格是一个单一的分隔符,这是根据你的问题想不是。
子
功能,可以做定期的前pression更换,在这种情况下删除 ..
preFIX领域中的7和8。
有关领域12,我们告诉的printf
来输出为数字(的 %d个
),并作出输入字符串由 0X pfixed $ p $
使其间$ p $点其为十六进制。
的注:的如果它并不总是你想要的输出是这样 $ 8_ $共9
,那么你实际上需要解析十六进制preFIX和报数的字符,以确定其中场结束。如果是这样的话,我会亲自preFER写别的东西,例如整个事情蟒蛇。
My input file has its content in following format, where each column is separated by a "space"
string1<space>string2<space>string3<space>YYYY-mm-dd<space>hh:mm:ss.SSS<space>string4<space>10:1234567890<space>0e:Apple 1.2.3.4<space><space>string5<space>HEX
There are 2 "spaces" after "0e:Apple 1.2.3.4" because there is no 14th digit in this field/column. The entire "0e:Apple 1.2.3.4space" is treated as a single value of that column.
In the 7th column, 10: represents the count of characters in the following string.
In the 8th column, 0e: represents a hex value of 14. So, the HEX values mention the count of characters in the string that follows.
Like:
"0e:Apple 1.2.3.4 "--> this is the actual value in 8th column without " "
(I've mentioned " " to show that the 14th digit is empty)
It's counted as
0e:A p p l e 1 . 2 . 3 . 4
| | | | | | | | | | | | | |
1 2 3 4 5 6 7 8 9 10 11 12 1314
Let's consider first row from the input file as:
string1 string2 string3 yyyy-mm-dd 23:50:45.999 string4 10:1234567890 0e:Apple 1.2.3.4 string5 001e
where:
string1
is the value in 1st columnstring2
is the value in 2nd columnstring3
is the value in 3rd columnyyyy-mm-dd
in 4th23:50:50.999
in 5thstring3
in 6th10:1234567890
in 7th //there is no space at the end because it has 10 digits0e:Apple 1.2.3.4
in 8th //space at the endstring5
in 9th001e
in 10th
Expected output:
string1,string2,string3,yyyy-mm dd,23:50:50.999,string3,1234567890,Apple_1.2.3.4,string5,30
Requirements:
- Eliminate the counts from 7th and 8th column (
10:
&0e:
) - The space b/w
Apple
and1.2.3.4
should be replace by "_" - Hex value in the last column should be converted to decimal value.
- Replace the "space" between columns with ","
- I've used hex value only in 10th column here. What if it's in several columns? Any way to convert it specific to certain columns?
I've tried using this:
$ cat input.txt |sed 's/[a-z0-9].*://g'
which gives output as:
string1,string2,string3,yyyy-mm-dd,45.999,string4,1234567890,Apple,1.2.3.4,,string5,001e
This will do what you want on your example input:
awk -F "[ ]" '{sub(/.*:/, "", $7) sub(/.*:/, "", $8); printf "%s,%s,%s,%s,%s,%s,%s,%s_%s,%s,%s,%d\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, "0x"$12}' input.txt
Explanation of parts:
awk
printf
allows you to specify an output format, so you can manually specify which fields you want to delimit with ,
and which you want to delimit with _
.
-F "[ ]"
forces the field separator to be a single space so that it knows there is an empty field between two single spaces. The default behavior would be to allow multiple spaces to be a single delimiter, which is not what you want according to the question.
The sub
function allows you to do regular expression replacement, in this case removing the ..:
prefix in fields 7 and 8.
For field 12, we tell printf
to output as a number (%d
) and give as input the string in prefixed by 0x
so that it interprets it as hexadecimal.
Note: If it's not always the case that you want the output to be $8_$9
, then you actually need to parse the hexadecimal prefix and count off characters in order to determine where the field ends. If that's the case, I would personally prefer to write the whole thing in something else, e.g. Python.
这篇关于如何使用SED / AWK解析一个文件的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!