UNIQ在AWK;在一列用awk删除重复值 [英] Uniq in awk; removing duplicate values in a column using awk
问题描述
我在下面的格式如下一个大型数据文件:
I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
的列是制表符分隔。列中的多个值用逗号分隔。我想删除第二列的重复值导致这样的事情:
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
我试过下面的下面的code,但它似乎并没有删除重复值。
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
我怎样才能删除重复的正确列2?
How can I remove the duplicates in column 2 correctly?
推荐答案
您的脚本仅作用于,因为文件中的第二个记录(行) NR == 2
。我把它,但它可能是你打算什么。如果是这样,你应该把它放回去。
Your script acts only on the second record (line) in the file because of NR==2
. I took it out, but it may be what you intend. If so, you should put it back.
的在
运算符将检查的首页的,而不是价值presence,所以我做了 duplicateArray
关联数组 * ,从 valueArray
作为其指数中的值。这样可以节省不必遍历数组都在上一循环中循环。
The in
operator checks for the presence of the index, not the value, so I made duplicateArray
an associative array* that uses the values from valueArray
as its indices. This saves from having to iterate over both arrays in a loop within a loop.
的拆分
语句看到WDR78,WDR78,WDR78,四大领域,而不是三个,所以我增加了一个如果
来保持它打印一个空值,这将导致,WDR78,被当如果
不在那里打印。
The split
statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if
to keep it from printing a null value which would result in ",WDR78," being printed if the if
weren't there.
*在现实中AWK所有的数组都是关联的。
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
这篇关于UNIQ在AWK;在一列用awk删除重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!