如何基于多列中的数据合并两个文件? [英] How to merge two files based on data in multiple columns?
问题描述
我有两个单独的文件,每个文件包含不同数量的列,我想根据多个列中的数据进行合并.
I have two separate files, each containing a different number of columns which I want to merge based on data in multiple columns.
file1
VMNF01000015.1 1769465 1769675 . . - Focub_II5_mimp_1
VMNF01000014.1 3225875 3226081 . . + Focub_II5_mimp_1
VMNF01000014.1 3226046 3226081 . . - Focub_II5_mimp_1
VMNF01000014.1 3585246 3585281 . . - Focub_II5_mimp_1
VMNF01000014.1 3692468 3692503 . . - Focub_II5_mimp_1
VMNF01000014.1 3715380 3715415 . . + Focub_II5_mimp_1
VMNF01000014.1 2872478 2872511 . . - Focub_II5_mimp_1
file2
VMNF01000014.1 3225875-3226081(+) gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3226046-3226081(-) tacacacctgcgaatactttttgcatcccactgta
VMNF01000015.1 1769465-1769675(-) gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3692468-3692503(-) tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1 3715380-3715415(+) gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3585246-3585281(-) tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1 2872478-2872511(-) gtacttcagcctggattcaaacttattgcatcccactgta
首先,我想我需要在file2中创建另外2列,用-"分隔数字,并为(*)"创建新列,但是我不知道如何在不替换((- )"也.到目前为止,我一直在使用以下命令:
First, I think I need to create another 2 columns in file2, separating numbers by "-" and creating a new column for "(*)", but I cannot work out how to separate the numbers without replacing "(-)" too. So far I have been using this command:
awk '{gsub("-","\t",$2);print;}'
完成此操作后,我想将file2中的最后一列添加到file1中.我已经可以使用以下命令执行此操作:
Once this has been done, I would like to add the last column in file2 to file1. I have been able to do this using the following command:
awk 'NR==FNR {a[$1]=$3; next} {print $1,$2,$3,$4,$5,$6,$7,a[$1];}' file2 file1 > file3.
但是,数据不匹配.它是根据第1列中的条目进行匹配的.在许多情况下,第1列中的数据是相同的,因此file3的第8列中的数据仅与其中一项匹配,而与第2或3列中的数据不匹配在文件1中
However, the data does not match. It is matched based on the entry in column 1. The data in column 1 is the same in many instances, so the data in column 8 of file3 only matches one of the entries, and doesn't match the data in column 2 or 3 in file1 e.g.
file3:
VMNF01000015.1 1769465 1769675 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3225875 3226081 . . + Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3226046 3226081 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3585246 3585281 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3692468 3692503 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3715380 3715415 . . + Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 2872478 2872511 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
即使我能够分离file2的第2列中的数据,我仍然会遇到相同的问题,因为在某些情况下,第2列中的数据是相同的.我需要的是这样的代码,其内容大致如下:对第2列中的数据进行分类(请参见下文);
Even if I was able to separate the data in column 2 of file2, I would still have the same problem as the data in column 2 is the same in some instances. What I need is code that says something along the lines of: sperate the data in column 2 (see below);
VMNF01000014.1 3225875 3226081 (+) gtacttcagcctggattcaaacttattgcatcccactgta
然后:
如果文件1中的$ 1,$ 2,$ 3与文件2中的$ 1,$ 2,$ 3相匹配,则从文件1中打印$ 1,$ 2,$ 3,$ 4,$ 5,$ 6,$ 7,并从文件2中添加$ 5.
if $1,$2,$3 in file1 match $1,$2,$3 in file2, print $1,$2,$3,$4,$5,$6,$7 from file1 and add $5 from file2.
我该怎么做?我知道awk可以使用if语句,但是我不知道如何在awk中使用它们.
How can I do this? I know that awk can use if statements, but I don't know how to use them in awk.
有什么建议吗?
推荐答案
能否请您尝试以下操作.
Could you please try following.
awk '
FNR==NR{
split($2,array,"[-(]")
mainarray[$1,array[1],array[2]]=$NF
next
}
(($1,$2,$3) in mainarray){
print $0,mainarray[$1,$2,$3]
}
' Input_file2 Input_file1
第二种解决方案: 由于OP在上述代码中出现错误,因此在上面进行了一些更改.
2nd solution: Since OP is getting an error in above code so made a little change in above.
awk '
FNR==NR{
split($2,array,"[-(]")
key=$1 OFS array[1] OFS array[2]
mainarray[key]=$NF
next
}
{ key = $1 OFS $2 OFS $3 }
(key in mainarray){
print $0,mainarray[key]
}
' Input_file2 Input_file1
说明: :添加了上述代码的详细说明.
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR when Input_file2 is being read.
split($2,array,"[-(]") ##Splitting 2nd field into an array named array where delimiter is - OR (
mainarray[$1,array[1],array[2]]=$NF ##Creating mainarray index of $1,array[1],array[2] and value is current line is last field.
next ##next will skip all further statements from here.
}
(($1,$2,$3) in mainarray){ ##Checking condition if $1,$2,$3 of current line is present in mainaarray.
print $0,mainarray[$1,$2,$3] ##Printing current line with value of mainarray with index of $1,$2,$3
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
这篇关于如何基于多列中的数据合并两个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!