UNIQ在AWK;在一列用awk删除重复值 [英] Uniq in awk; removing duplicate values in a column using awk

查看:117
本文介绍了UNIQ在AWK;在一列用awk删除重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面的格式如下一个大型数据文件:

I have a large datafile in the following format below:

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

的列是制表符分隔。列中的多个值用逗号分隔。我想删除第二列的重复值导致这样的事情:

The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

我试过下面的下面的code,但它似乎并没有删除重复值。

I tried the following code below but it doesn't seem to remove the duplicate values.

awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3

}' knownGeneFromUCSC.txt

我怎样才能删除重复的正确列2?

How can I remove the duplicates in column 2 correctly?

推荐答案

您的脚本仅作用于,因为文件中的第二个记录(行) NR == 2 。我把它,但它可能是你打算什么。如果是这样,你应该把它放回去。

Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.

运算符将检查的首页的,而不是价值presence,所以我做了 duplicateArray 关联数组 * ,从 valueArray 作为其指数中的值。这样可以节省不必遍历数组都在上一循环中循环。

The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.

拆分语句看到WDR78,WDR78,WDR78,四大领域,而不是三个,所以我增加了一个如果来保持它打印一个空值,这将导致,WDR78,被当如果不在那里打印。

The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.

*在现实中AWK所有的数组都是关联的。

awk '
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf $1 "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print $3
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'

这篇关于UNIQ在AWK;在一列用awk删除重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆