重击:在X列中保留第一行重复的值 [英] Bash: Keep first line with duplicate values in column X

查看:67
本文介绍了重击:在X列中保留第一行重复的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数千行和20多个列的文件.现在,我想在第3列中标识与其他行具有相同电子邮件地址的行,但只保留第一行与此电子邮件地址.

I have a file with a few thousand lines and 20+ columns. I now want to identify the lines that have the same e-mail address in column 3 as in other lines BUT only keep the first line with this e-mail address.

文件:(名字;姓氏;电子邮件; ...)

file: (First Name; Last Name; E-Mail; ...)

Mike;Tyson;mike@tyson.com
Tom;Boyden;tom@boyden.com
Tom;Cruise;mike@tyson.com
Mike;Myers;mike@tyson.com
Jennifer;Lopez;jennifer@lopez.com
Andre;Agassi;tom@boyden.com
Paul;Walker;paul@walker.com

对于第3列中每封重复的电子邮件,我只想保留FIRST行.我不想保留电子邮件地址唯一的行.

For every e-mail duplicate in column 3 I only want to keep the FIRST line. I don't want to keep the lines where the e-mail address is unique.

在这种情况下,预期的输出将是

In this case the expected output would be

Mike;Tyson;mike@tyson.com
Tom;Boyden;tom@boyden.com

如果我使用

awk -F';' '!seen[$3]++' file

在第1行和第2行中,我将丢失电子邮件地址的第一个实例,并且仅保留重复项.我要找的基本上是完全相反的:丢失所有重复项,但仅保留第一个实例.

I will lose the first instance of the e-mail address, in this case line 1 and 2 and will keep ONLY the duplicates. What I'm looking for is basically the exact opposite: lose all duplicates but keep only the first instance.

使用awk的解决方案会很棒,但我不知道如何保留第一行(不重复).有人知道怎么做吗?

A solution with awk would be great but I can't figure out how to also keep the first line (not ONLY the duplicates). Does anyone know how to do that?

谢谢, 帕特里克

推荐答案

使用Perl可以按输入顺序在输入中打印每一封电子邮件中第一次出现的邮件,且出现次数超过1.根据OP的评论:

Use Perl to print the first occurrence of each email with more than 1 occurrence in the input in the input order. As per the OP's comment:

"我只是在寻找包含电子邮件重复项的第一行.在这种情况下,我想摆脱仅包含一次/唯一的电子邮件地址的所有行.因此没有paul@walker.comjennifer@lopez.com."

# Create the input file:

cat > in.txt <<EOF
Mike;Tyson;mike@tyson.com
Tom;Boyden;tom@boyden.com
Tom;Cruise;mike@tyson.com
Mike;Myers;mike@tyson.com
Jennifer;Lopez;jennifer@lopez.com
Andre;Agassi;tom@boyden.com
Paul;Walker;paul@walker.com
EOF

cat in.txt | perl -F';' -lane 'my $email = $F[2]; unless ( $seen{$email}++ ) { $line_for{$email} = $_; push @emails, $email; } END { for my $email ( @emails ) { print $line_for{$email} if $seen{$email} > 1; }  }; '

打印:

Mike;Tyson;mike@tyson.com
Tom;Boyden;tom@boyden.com


Perl单行代码使用以下命令行标志:
-e:告诉Perl在线查找代码,而不是在文件中查找.
-n:一次循环输入一行,默认情况下将其分配给$_.
-l:在直接执行代码之前,剥离输入行分隔符(默认为* NIX上的"\n"),并在打印时附加它. -a:将$_拆分为空格或-F选项中指定的正则表达式上的数组@F.
-F';':在分号(而不是空格)上分成@F.


The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F';' : Split into @F on semicolon, rather than on whitespace.

%seen:哈希与键=电子邮件,值=出现的次数.
$seen{ $F[2] }++:电子邮件的出现次数增加1(第三个字段,该字段的索引为2).首次显示电子邮件之前,该值为undef,并且在布尔上下文中其值为false.因此,原始输入行存储在hash元素$line_for{$email} = $_;中,电子邮件按输入中出现的顺序存储在数组@emails中.查看电子邮件后,其值为1或更大,并评估为true.因此该行未存储.
END { ... }:在读取所有输入之后执行代码,然后退出.
print $line_for{$email} if $seen{$email} > 1;:如果电子邮件的出现次数超过一次(如果重复),则打印此电子邮件的原始行,即在输入中找到的第一行.

%seen : hash with keys = emails, and values = number of occurrences.
$seen{ $F[2] }++ : increment by 1 the number of occurrences of the email (3rd field, index of this field is 2). Before the email is seen for the first time, the value is undef and it evaluates to false in boolean context. So the original input line is stored in the hash element: $line_for{$email} = $_;, and the email is stored in the array @emails, in the order of appearance in the input. After the email has been seen, its value is 1 or more, and evaluates to true. So the line is not stored.
END { ... } : Execute the code after all input has been read, before exiting.
print $line_for{$email} if $seen{$email} > 1; : If the number of occurrences of the email is more than one (if it is a duplicate), print the original line for this email, the first one that was found in the input.

另请参见:

perldoc perlrun:如何执行Perl解释器:命令行开关

perldoc perlrun: how to execute the Perl interpreter: command line switches

这篇关于重击:在X列中保留第一行重复的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆