从使用awk 2过滤的文件删除重复 [英] Remove duplicates from 2 filtered files with awk

查看:157
本文介绍了从使用awk 2过滤的文件删除重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个源文件(英文文件和意大利文件)在相同数量的行和我执行 awk命令来删除从中有2个以上的话IT.txt文件中的所有行

  EN.txt
圣诞老人
猪不靠谱
父亲的儿子
小精灵
圣诞老人
小精灵
马靴
鞋IT.txt
Babbo迪纳塔莱
我maiali非volano
伊尔FIGLIO德尔神父
ELFO
Babbo迪纳塔莱
ELFO
SCARPE
SCARPE

所以基本上我都有这样的输出:

  EN.txt
圣诞老人
猪不靠谱
父亲的儿子
小精灵
圣诞老人
小精灵
马靴
鞋IT.txt
Babbo迪纳塔莱
ELFO
Babbo迪纳塔莱
ELFO
SCARPE
SCARPE

但在同一时间,我想从EN.txt文件中删除同一相关的字符串。我以为我可以行号工作(一会儿,然后我找到了一个更好的解决方案),而不是运行的另一个awk命令以同样的方式以去除EN文件中有超过2个字的字符串,因为翻译可能是从源字符串不同(如有更多的字)。所以我需要专注我的工作对IT文件和EN文件必须遭受命令我公司推出的效果。因此,我的滤波输出必须是这样的:

  EN.txt
圣诞老人
小精灵
圣诞老人
小精灵
马靴
鞋IT.txt
Babbo迪纳塔莱
ELFO
Babbo迪纳塔莱
ELFO
SCARPE
SCARPE

这是我试图用(建议用previous问题)命令,它完美的作品:的awk'NR == FNR {如果(NF> 3){A [NR]}其他{一[NR] = 1;打印> filtered_it.txt}} NR = FNR和放大器;!&安培;一个[FNR] {打印> filtered_en.txt}'IT.txt EN.txt

但现在我想增加额外的这个命令,就像为了有一个这样的输出删除重复,但是要小心,可能有同样的翻译在意大利,但他们各自的源字符串不同的线路(例如马靴翻译成斯卡尔普)。最后,我只需要两个文件在同一时间(在某种程度上)从运行的每个单个命令单一删除重复,而不是

  EN.txt
圣诞老人
小精灵
马靴
鞋IT.txt
Babbo迪纳塔莱
ELFO
SCARPE
SCARPE


解决方案

您规格非常混乱,但我认为这是你想要的。此外,而不是对两个文件操作,如果他们都应该由线更容易开始做匹配行首。

  $粘贴EN.txt IT.txt
          | awk的-F'\\ t''{N =拆分($ 1,_,);
                         M =拆分($ 2,_)}
 N'3;放大器;&放大器; M&3;放大器;&放大器; !一个[$ 0] ++ {打印$ 1 GT; f_EN.txt;
                         打印$ 2 - ; f_IT.txt}$猫f_EN.txt
圣诞老人
小精灵
马靴
鞋$猫f_IT.txt
Babbo迪纳塔莱
ELFO
SCARPE
SCARPE

PS。你要么认为时间旅行是可能的,或者使用明天,而不是昨天:)

I have 2 source files (an english file and an italian file) with the same number of lines and i perform an awk command to remove all lines from the IT.txt file which have more than 2 words

EN.txt
Santa Claus
Pigs don't fly
The son of the father
Elf
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
I maiali non volano
Il figlio del padre
Elfo
Babbo Natale
Elfo
Scarpe
Scarpe

So basically i have this kind of output:

EN.txt
Santa Claus
Pigs don't fly
The son of the father
Elf
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
Elfo
Babbo Natale
Elfo
Scarpe
Scarpe

But at the same time, i'd like to remove the same related strings from the EN.txt file. I thought I could work on the line number (for a moment, then i found out a better solution) and not on running another awk command to remove in the same way the strings having more than 2 words in the EN file, because a translation could be different from the source string (like having more words). So i need to focus my work to the IT file and the EN file must suffer the effect of command i launched. Therefore, my filtered output must be like this:

EN.txt
Santa Claus
Elf
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
Elfo
Babbo Natale
Elfo
Scarpe
Scarpe

this is the command i tried with (suggested with a previous question) and it works perfectly: awk 'NR==FNR{if(NF>3){a[NR]}else{a[NR]=1;print > "filtered_it.txt"}} NR!=FNR && a[FNR]{print > "filtered_en.txt"}' IT.txt EN.txt

But now i'd like to add extra on this command, like removing duplicates in order to have an output like this, but being careful to those lines that may have the same translation in italian but their respective source strings are different (like Sabatons and Shoes translated into Scarpe). In conclusion, i need to remove the duplicates only from both files at the same time (somehow) and not from a single one running each single command.

EN.txt
Santa Claus
Elf
Sabatons
Shoes

IT.txt
Babbo Natale
Elfo
Scarpe
Scarpe

解决方案

Your spec is very confusing but I think this is what you wanted. Also, instead of operating on two files, if they are supposed to be matched line by line it's easier to start doing that first.

$ paste EN.txt IT.txt
          | awk -F'\t' '{n=split($1,_," ");
                         m=split($2,_," ")} 
 n<3 && m<3 && !a[$0]++ {print $1 > "f_EN.txt";
                         print $2 > "f_IT.txt"}' 

$ cat f_EN.txt 
Santa Claus
Elf
Sabatons
Shoes

$ cat f_IT.txt   
Babbo Natale
Elfo
Scarpe
Scarpe

ps. You either believe time travel is possible or using "tomorrow" instead of "yesterday" :)

这篇关于从使用awk 2过滤的文件删除重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆