awk:在生成数据时保留行顺序并删除重复的字符串(镜像) [英] awk: preserve row order and remove duplicate strings (mirrors) when generating data
问题描述
我有两个文本文件
g1.txt
alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;
g2.txt
Jack to ride.zip;http://alfa.org;
JKr.rui.rar;http://gamma.org;
Nofj ogk.png;http://gamma.org;
我使用此命令来运行awk脚本
I use this command to run my awk script
awk -f ./join2.sh g1.txt g2.txt > "g3.txt"
我得到了这个输出
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;http://alfa.org;JKr.rui.rar;http://gamma.org;Nofj ogk.png;http://gamma.org;
alfa beta;www.google.com;
有什么问题?
1.行顺序不守恒,例如在输出文件g3.txt中,行 alfa beta; www.google.com;
行位于行之后轻...
.如应该在g1.txt
中看到的那样 2.我在 Light ..
行中有很多镜像字符串,可以在g3.txt
What are the problems?
1. row order is not conservated, for example in the output file g3.txt, the line alfa beta;www.google.com;
is after the line Light...
. when it should be first, as you can see in g1.txt
2. I have many mirror strings in Light..
line, you can see that in g3.txt
http://alfa.org
http://gamma.org
http://gamma.org
在同一行中重复.
我想要什么样的行输出? 像这样:
alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png;
首先:我尝试实现一个检查行中是否存在普通字符串的函数,例如,您是否在行输出中看到 Light Dweller-CR,Technical Metal.
那行内有相同的字符串?例如 http://alfa.org
和 http://gamma.org
?好吧,我不要这个.我希望每个字符串都包含在定界符中;只能出现一次,并且每行只能出现一次.
此规则应仅适用于输出文件g3.txt
First: I try to implement a function that check if there are ugual strings inside a row, for example do you see in my row output Light Dweller - CR, Technical Metal...
that there are identical string inside that row? For example http://alfa.org
and http://gamma.org
? Ok, I don't want this. I want each string, enclosed within delimiters; is present only once and only once for each row.
This rule should only apply to the output file, g3.txt
第二个::我希望g1.txt中的行的原始顺序必须在g3.txt输出文件中保留.例如,在g1.txt中,我有
Second: I want that original order of rows in g1.txt must be maintained in the g3.txt output file. For example, in g1.txt I have
alfa beta ...
Light Dweller ...
但是我的脚本给我返回了不同的顺序
but my script returns to me a different ordering
Light Dweller ...
alfa beta ...
我想防止对行进行重新排序
I want to prevent reordering of rows
我的 join2.sh 脚本是这个
#! /usr/bin/awk -f
BEGIN {
OFS=FS=";"
C=0;
}
{
if (ARGIND == 1) {
X = $NF
T0[$NF] = C++
$NF = ""
if (T1[X]) {
T1[X] = T1[X] $0
} else {
T1[X] = $0
}
} else {
X = $NF
T0[$NF] = C++
$NF = ""
if (T2[X]) {
T2[X] = T2[X] $0
} else {
T2[X] = $0
}
}
}
END {
for (X in T0) {
# concatenate T1[X] and X, since T1[X] ends with ";"
print T1[X] X, T2[X]
}
}
解决方案:
推荐答案
您应首先像这样处理 g2.txt
:
cat join2.awk
BEGIN {
OFS=FS=";"
}
ARGIND == 1 {
map[$2] = ($2 in map ? map[$2] OFS : "") $1
next
}
{
r = $0;
for (i=1; i<=NF; ++i)
if ($i in map)
r = r OFS map[$i]
$0 = r
}
1
然后将其用作:
awk -f join2.awk g2.txt g1.txt
alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png
这篇关于awk:在生成数据时保留行顺序并删除重复的字符串(镜像)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!