AWK从数组中获取唯一元素 [英] Awk get unique elements from array

查看:55
本文介绍了AWK从数组中获取唯一元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

file.txt:

INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L

我正在尝试通过转换文件内容来创建超链接.超链接将具有以下样式:

I am attempting to create a hyperlink by transforming the content of a file. The hyperlink will have the following style:

somelink&gene=<gene>[&gene=<gene>]&mutation=<gene:key>[&mutation=<gene:key>]

其中 INTS11:P446P 对应于 gene:key 例如

问题是我在每一行上循环创建一个包含 genes 作为值的数组,因此可以为同一个 gene 找到多个重复的条目

The problem is that I am looping on the each row to create an array that contains the genes as values and thus multiple duplicated entries can be found for the same gene.

我的尝试如下

  1. & 上分割并存储在 a
  2. 对于 a 中的每个元素,在:上分割,然后将 a [i] 添加到数组 b
  1. Split on & and store in a
  2. For each element in a, split on : and add a[i] to array b

问题是我不知道如何从数组中获取唯一值.我发现了这个

The problem is that I don't know how to get unique values from my array. I found this question but it talks about files and not arrays like in my case.

代码:

awk '@include "join"
    {
    split($0,a,"&")
    for ( i = 1; i <= length(a); i++ ) {
        split(a[i], b, ":");
        genes[i] = "&gene="b[1];
        keys[i] = "&mutation="b[1]":"b[2]
    }
    print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
    delete genes
    delete keys
}' file.txt

将输出:

somelink&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&mutation=INTS11:P446P&mutation=INTS11:P449P&mutation=INTS11:P518P&mutation=INTS11:P547P&mutation=INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&gene=PLCH2&mutation=PLCH2:A1007int&mutation=PLCH1:D987int &mutation=PLCH2:P977L

我希望获得类似的信息(注意有多少& gene = ):

I wish to obtain something similar like (notice how many &gene= is there):

somelink&gene=INTS11&mutation=INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&mutation=PLCH2:A1007int&mutation=PLCH1:D987int&mutation=PLCH2:P977L

我的问题得到了部分解决,这要归功于Pierre Francois的答案,即 SUBSEP .我的另一个问题是,我只想从数组 genes keys 中获得唯一元素.

my problem was partly solved thanks to Pierre Francois's answer which was the SUBSEP. My other issue is that I want to get only unique elements from my arrays genes and keys.

谢谢.

推荐答案

假设您要删除与 awk join 函数串联的字段之间的空格,您必须提供给 join 函数的第四个参数是幻数 SUBSEP ,而不是像您一样的空字符串" .试试:

Supposing you want to remove the spaces between the fields concatenated with the join function of awk, the 4th argument you have to provide to the join function is the magic number SUBSEP and not an empty string "" as you did. Try:

awk '@include "join"
    {
    split($0,a,"&")
    for ( i = 1; i <= length(a); i++ ) {
        split(a[i], b, ":");
        genes[i] = "&gene="b[1];
        keys[i] = "&mutation="b[1]":"b[2]
    }
    print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
    delete genes
    delete keys
}' file.txt

这篇关于AWK从数组中获取唯一元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆