UNIQ在AWK;在一列用awk删除重复值 [英] Uniq in awk; removing duplicate values in a column using awk

查看：117 发布时间：2016/7/28 14:57:46 bash awk unique

本文介绍了UNIQ在AWK;在一列用awk删除重复值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在下面的格式如下一个大型数据文件：

I have a large datafile in the following format below:

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

的列是制表符分隔。列中的多个值用逗号分隔。我想删除第二列的重复值导致这样的事情：

The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

我试过下面的下面的code，但它似乎并没有删除重复值。

I tried the following code below but it doesn't seem to remove the duplicate values.

awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3

}' knownGeneFromUCSC.txt

我怎样才能删除重复的正确列2？

How can I remove the duplicates in column 2 correctly?

推荐答案

您的脚本仅作用于，因为文件中的第二个记录（行） NR == 2 。我把它，但它可能是你打算什么。如果是这样，你应该把它放回去。

Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.

的在运算符将检查的首页的，而不是价值presence，所以我做了 duplicateArray 关联数组^*，从 valueArray 作为其指数中的值。这样可以节省不必遍历数组都在上一循环中循环。

The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array^* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.

的拆分语句看到WDR78，WDR78，WDR78，四大领域，而不是三个，所以我增加了一个如果来保持它打印一个空值，这将导致，WDR78，被当如果不在那里打印。

The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.

^{*在现实中AWK所有的数组都是关联的。}

awk '
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf $1 "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print $3
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'

这篇关于UNIQ在AWK;在一列用awk删除重复值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

UNIQ在AWK;在一列用awk删除重复值 [英] Uniq in awk; removing duplicate values in a column using awk

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

UNIQ在AWK;在一列用awk删除重复值 [英] Uniq in awk; removing duplicate values in a column using awk

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭