AWK从数组中获取唯一元素 [英] Awk get unique elements from array
问题描述
file.txt:
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
我正在尝试通过转换文件内容来创建超链接.超链接将具有以下样式:
I am attempting to create a hyperlink by transforming the content of a file. The hyperlink will have the following style:
somelink&gene=<gene>[&gene=<gene>]&mutation=<gene:key>[&mutation=<gene:key>]
其中 INTS11:P446P
对应于 gene:key
例如
问题是我在每一行上循环创建一个包含 genes
作为值的数组,因此可以为同一个 gene
找到多个重复的条目
The problem is that I am looping on the each row to create an array that contains the genes
as values and thus multiple duplicated entries can be found for the same gene
.
我的尝试如下
- 在
&
上分割并存储在a
中 - 对于
a
中的每个元素,在:
上分割,然后将a [i]
添加到数组b
- Split on
&
and store ina
- For each element in
a
, split on:
and adda[i]
to arrayb
The problem is that I don't know how to get unique values from my array. I found this question but it talks about files and not arrays like in my case.
代码:
awk '@include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt
将输出:
somelink&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&mutation=INTS11:P446P&mutation=INTS11:P449P&mutation=INTS11:P518P&mutation=INTS11:P547P&mutation=INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&gene=PLCH2&mutation=PLCH2:A1007int&mutation=PLCH1:D987int &mutation=PLCH2:P977L
我希望获得类似的信息(注意有多少& gene =
):
I wish to obtain something similar like (notice how many &gene=
is there):
somelink&gene=INTS11&mutation=INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&mutation=PLCH2:A1007int&mutation=PLCH1:D987int&mutation=PLCH2:P977L
我的问题得到了部分解决,这要归功于Pierre Francois的答案,即 SUBSEP
.我的另一个问题是,我只想从数组 genes
和 keys
中获得唯一元素.
my problem was partly solved thanks to Pierre Francois's answer which was the SUBSEP
. My other issue is that I want to get only unique elements from my arrays genes
and keys
.
谢谢.
推荐答案
假设您要删除与 awk 的 join 函数串联的字段之间的空格,您必须提供给 join 函数的第四个参数是幻数 SUBSEP
,而不是像您一样的空字符串"
.试试:
Supposing you want to remove the spaces between the fields concatenated with the join function of awk, the 4th argument you have to provide to the join function is the magic number SUBSEP
and not an empty string ""
as you did. Try:
awk '@include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt
这篇关于AWK从数组中获取唯一元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!