awk:在生成数据时保留行顺序并删除重复的字符串(镜像) [英] awk: preserve row order and remove duplicate strings (mirrors) when generating data

查看:60
本文介绍了awk:在生成数据时保留行顺序并删除重复的字符串(镜像)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文本文件

g1.txt

 alfa beta;www.google.com
 Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;

g2.txt

Jack to ride.zip;http://alfa.org;
JKr.rui.rar;http://gamma.org;
Nofj ogk.png;http://gamma.org;

我使用此命令来运行awk脚本

I use this command to run my awk script

awk -f ./join2.sh g1.txt g2.txt > "g3.txt"

我得到了这个输出

Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;http://alfa.org;JKr.rui.rar;http://gamma.org;Nofj ogk.png;http://gamma.org;
alfa beta;www.google.com;

有什么问题?

1.行顺序不守恒,例如在输出文件g3.txt中,行 alfa beta; www.google.com; 行位于行之后轻... .如应该在g1.txt
中看到的那样 2.我在 Light .. 行中有很多镜像字符串,可以在g3.txt

What are the problems?

1. row order is not conservated, for example in the output file g3.txt, the line alfa beta;www.google.com; is after the line Light.... when it should be first, as you can see in g1.txt
2. I have many mirror strings in Light.. line, you can see that in g3.txt

http://alfa.org
http://gamma.org
http://gamma.org

在同一行中重复.

我想要什么样的行输出? 像这样:

alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png;

首先:我尝试实现一个检查行中是否存在普通字符串的函数,例如,您是否在行输出中看到 Light Dweller-CR,Technical Metal.那行内有相同的字符串?例如 http://alfa.org http://gamma.org ?好吧,我不要这个.我希望每个字符串都包含在定界符中;只能出现一次,并且每行只能出现一次.
此规则应仅适用于输出文件g3.txt

First: I try to implement a function that check if there are ugual strings inside a row, for example do you see in my row output Light Dweller - CR, Technical Metal... that there are identical string inside that row? For example http://alfa.org and http://gamma.org ? Ok, I don't want this. I want each string, enclosed within delimiters; is present only once and only once for each row.
This rule should only apply to the output file, g3.txt

第二个::我希望g1.txt中的行的原始顺序必须在g3.txt输出文件中保留.例如,在g1.txt中,我有

Second: I want that original order of rows in g1.txt must be maintained in the g3.txt output file. For example, in g1.txt I have

alfa beta ... 
Light Dweller ... 

但是我的脚本给我返回了不同的顺序

but my script returns to me a different ordering

Light Dweller ...
alfa beta ... 

我想防止对行进行重新排序

I want to prevent reordering of rows

我的 join2.sh 脚本是这个

#! /usr/bin/awk  -f

BEGIN {
  OFS=FS=";"
  C=0;
}
{
  if (ARGIND == 1) {
     X = $NF
     T0[$NF] = C++
     $NF = ""
     if (T1[X]) {
        T1[X] = T1[X] $0
     } else {
        T1[X] = $0
     }
  } else {
     X = $NF
     T0[$NF] = C++
     $NF = ""
     if (T2[X]) {
        T2[X] = T2[X] $0
     } else {
        T2[X] = $0
     }
  }
}

END {
  for (X in T0) {
    # concatenate T1[X] and X, since T1[X] ends with ";"
    print T1[X]  X, T2[X]
  }
}

解决方案:

推荐答案

您应首先像这样处理 g2.txt :

cat join2.awk

BEGIN {
  OFS=FS=";"
}
ARGIND == 1 {
   map[$2] = ($2 in map ? map[$2] OFS : "") $1
   next
}
{
   r = $0;
   for (i=1; i<=NF; ++i)
      if ($i in map)
         r = r OFS map[$i]
   $0 = r
}
1

然后将其用作:

awk -f join2.awk g2.txt g1.txt

alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png

这篇关于awk:在生成数据时保留行顺序并删除重复的字符串(镜像)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆