为awk列说明符传递bash变量 [英] passing bash variable for awk column specifier
问题描述
有很多关于将shell变量传递给awk的线程,我已经很容易地弄明白了,但是我要传递的变量是列说明符变量($1,$2
等)
There are loads of threads about passing a shell variable to awk, and I've figured that out easily enough, but the variable I want to pass is the column specifier variable ($1,$2
etc)
考虑到shell也将这些变量用作默认的命令行参数变量,这令人困惑.
Given that the shell uses these variables as default command line argument variables as well, this is getting confusing.
在此脚本中,我只是将2个文件排序并连接在一起,但是为了开始泛化该脚本,我希望能够在命令行上指定awk应该在密钥文件中的字段作为其排序说明符.
In this script I'm just sorting and joining 2 files together, but in order to begin generalising the script a little, I want to be able to specify on the command line, the field in the key file that awk should be taking as its sort-specifier.
我在这里做错了什么? (我只是刚开始接触awk,而oneliner从
What am I doing wrong here? (I'm only just getting to grips with awk and the oneliner was adapted slightly from here.
keyfile="$1"
filetosort="$2"
field="$3"
awk -v a="$field"
paste "$keyfile" <(awk 'NR==FNR{o[FNR]=a; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' $keyfile $filetosort)
编辑在/输出中添加了示例
EDIT Added example in/output
密钥文件:(来自文件的10条随机行)
PVClumt18 PAK_2199 PAK_01997
PVClopt2 PAK_2091 PAK_01895
PVCcif7 PAK_1975 PAK_01793
PVClopT12 PAU_02101 PAU_02063
PVCpnf20 PAK_3524 PAK_03184
PVClopt3 PAK_2090 PAK_01894
PVClopT11 PAU_02102 PAU_02064
PVCunit2_11 plu1698 PLT_01726
PVClumT9 afp10 PAU_02198
PVCunit2_17 plu1692 PLT_01720
要排序的文件:
PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A
PAK_01997 5ftj 5ftj_A 99.9 1.6e-26 4.2e-31 229.2 >5ftj_A Transitional endoplasmic reticulum ATPase; hydrolase, single-particle, AAA ATPase; HET: ADP OJA; 2.30A {Homo sapiens} PDB: 3cf1_A* 3cf3_A* 3cf2_A* 5ftk_A* 5ftl_A* 5ftm_A* 5ftn_A* 1r7r_A* 5c19_A 5c1b_A* 5c18_A* 3cf0_A*
PAK_01894 3j9q 3j9q_A 99.9 1.8e-29 4.6e-34 215.9 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PAK_03184 1xju 1xju_A 99.4 4.1e-17 1.1e-21 98.8 >1xju_A Lysozyme; secreted inactive conformation, hydrolase; 1.07A {Enterobacteria phage P1} SCOP: d.2.1.3
PAK_01793 5a3a 5a3a_A 50.8 6 0.00016 31.4 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PLT_01720 3ggm 3ggm_A 54.2 4.9 0.00013 26.2 >3ggm_A Uncharacterized protein BT9727_2919; bacillus cereus group., structural genomics, PSI-2, protein structure initiative; 2.00A {Bacillus thuringiensis serovarkonkukian}
PLT_01726 3h2t 3h2t_A 96.8 8e-06 2.1e-10 82.6 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A
PAK_01895 3j9q 3j9q_A 100.0 2.5e-35 6.4e-40 248.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PAU_02198 4jiv 4jiv_D 69.6 1.6 4.2e-05 27.5 >4jiv_D VCA0105, putative uncharacterized protein; PAAR-repeat motif, membrane piercing, type VI secretion SYST vibrio cholerae VGRG2; HET: PLM STE ELA; 1.90A {Vibrio cholerae o1 biovar eltor}
PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
因此,我需要根据密钥文件中的第3列和文件中的第1列对行进行排序和匹配.
Thus I need to sort and match the rows based on column 3 in the keyfile, and column 1 in the file to sort.
以及生成的文件:(第3列和第4列的重复是我打算在之后进行整理的内容)
And the resulting file: (The duplication of columns 3 & 4 was something I was planning to sort out after)
PVClumt18 PAK_2199 PAK_01997 PAK_01997 5ftj 5ftj_A 99.9 1.6e-26 4.2e-31 229.2 >5ftj_A Transitional endoplasmic reticulum ATPase; hydrolase, single-particle, AAA ATPase; HET: ADP OJA; 2.30A {Homo sapiens} PDB: 3cf1_A* 3cf3_A* 3cf2_A* 5ftk_A* 5ftl_A* 5ftm_A* 5ftn_A* 1r7r_A* 5c19_A 5c1b_A* 5c18_A* 3cf0_A*
PVClopt2 PAK_2091 PAK_01895 PAK_01895 3j9q 3j9q_A 100.0 2.5e-35 6.4e-40 248.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVCcif7 PAK_1975 PAK_01793 PAK_01793 5a3a 5a3a_A 50.8 6 0.00016 31.4 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PVClopT12 PAU_02101 PAU_02063 PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
PVCpnf20 PAK_3524 PAK_03184 PAK_03184 1xju 1xju_A 99.4 4.1e-17 1.1e-21 98.8 >1xju_A Lysozyme; secreted inactive conformation, hydrolase; 1.07A {Enterobacteria phage P1} SCOP: d.2.1.3
PVClopt3 PAK_2090 PAK_01894 PAK_01894 3j9q 3j9q_A 99.9 1.8e-29 4.6e-34 215.9 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVClopT11 PAU_02102 PAU_02064 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A
PVCunit2_11 plu1698 PLT_01726 PLT_01726 3h2t 3h2t_A 96.8 8e-06 2.1e-10 82.6 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A
PVClumT9 afp10 PAU_02198 PAU_02198 4jiv 4jiv_D 69.6 1.6 4.2e-05 27.5 >4jiv_D VCA0105, putative uncharacterized protein; PAAR-repeat motif, membrane piercing, type VI secretion SYST vibrio cholerae VGRG2; HET: PLM STE ELA; 1.90A {Vibrio cholerae o1 biovar eltor}
PVCunit2_17 plu1692 PLT_01720 PLT_01720 3ggm 3ggm_A 54.2 4.9 0.00013 26.2 >3ggm_A Uncharacterized protein BT9727_2919; bacillus cereus group., structural genomics, PSI-2, protein structure initiative; 2.00A {Bacillus thuringiensis serovarkonkukian}
推荐答案
当您传递awk -v a="$field"
时,awk变量a
的说明仅适用于该单个awk
命令 .您不能期望a
在awk
的完全不同的调用中可用.
When you pass awk -v a="$field"
, the specification of the awk variable a
is only good for that single awk
command. You can't expect a
to be available in a completely different invocation of awk
.
因此,您需要将其直接放置在 中:
Thus, you need to put it in-place directly:
$ bashvar="2"
$ echo 'foo bar baz' | awk -v awkvar="$bashvar" '{print $awkvar}'
bar
或者您的情况:
field=1
awk -v a="$field" '
NR==FNR {
o[FNR]=$a;
next;
}
{ t[$1] = $0 }
END {
for(x=1; x<=FNR; x++) {
y=o[x]
printf("%s\t%s\n", y, t[y])
}
}' "$keyfile" "$filetosort"
要点:
Points of note:
- 我们的
printf
在这里同时发出 键和值,因此不需要使用paste
将keyfile
值放回去. -
$a
用于将awk变量a
(从shell变量field
分配)视为变量名称本身,并执行间接引用-因此查找相关的列号. - 总是,总是在扩展时引用您的shell变量.否则,您将无法知道
$keyfile
的扩展将为awk
生成多少个参数-可能为0(如果在IFS中找不到字符串中没有字符的话);否则,可能为0.可以是1,但也可以是一个完全不受限制的数字(input file.txt
将成为两个参数,分别是input
和file.txt
;* input * .txt
将每个*
替换为文件列表).
- Our
printf
here is emitting both the key and the value, so there's no need to usepaste
to put thekeyfile
values back in. $a
is used to treat the awk variablea
(assigned from shell variablefield
) as a variable name itself, and to perform an indirect reference -- thus, looking up the relevant column number.- Always, always quote your shell variables on expansion. Otherwise, you have no way of knowing how many argument to
awk
will be generated by the expansion of$keyfile
-- it could be 0 (if there are no characters in the string not found in IFS); it could be 1, but it could also be a completely unbounded number (input file.txt
would become two arguments,input
andfile.txt
;* input * .txt
would have each*
replaced with a list of files).
这篇关于为awk列说明符传递bash变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!