适当加入两个文件的基础上共同两列 [英] properly join two files based on 2 columns in common

查看:232
本文介绍了适当加入两个文件的基础上共同两列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件,​​我想加入根据1列,2 /合并他们是这个样子,有文件1(58210行)是比file2(815530行)要短得多,我想找到这两个文件的基于场1和2作为指标的交集:

文件1:

  2L 25753 33158
2L 28813 33158
2L 31003 33158
2L 31077 33161
2L 31279 33161
3L 32124 45339
3L 33256 45339
...

文件2:

  2L 20242 0.5 0.307692307692308
2L 22141 0.32258064516129 0.692307692307692
2L 24439 0.413793103448276 0.625
2L 24710 0.371428571428571 0.631578947368421
2L 25753 0.967741935483871 0.869565217391304
2L 28813 0.181818181818182 0.692307692307692
2L 31003 0.36 0.666666666666667
2L 31077 0.611111111111111 0.931034482758621
2L 31279 0.75 1
3L 32124 0.558823529411765 0.857142857142857
3L 33256 0.769230769230769 0.90625
...

我一直在使用下面这些命令,但与不同数量的行结束:

 的awk'FNR == {NR一个[$ 1 $ 2] = $ 3;接下来} {如果($ 1 $ 2中)打印}'文件1文件2 |厕所-l
awk的'FNR == {NR一个[$ 1 $ 2] = $ 3;接下来} {如果($ 1 $ 2中)打印}'文件2文件1 |厕所-l

我不知道为什么会这样,我已经试过之前比较排序,以防万一我有重复的行(根据1列2)在这两个文件中,但它似乎没有不帮帮我。 (为什么这是任何见解所以也AP preciated)

我如何才能合并文件,这样只是有相应的列1和2中得到的file1打印的文件2线,与文件1第3栏添加上看起来是这样的:

  2L 25753 0.967741935483871 0.869565217391304 33158
2L 28813 0.181818181818182 0.692307692307692 33158
2L 31003 0.36 33158 0.666666666666667
2L 31077 0.611111111111111 0.931034482758621 33161
2L 31279 0.75 1 33161
3L 32124 0.558823529411765 0.857142857142857 45339
3L 33256 0.769230769230769 0.90625 45339


解决方案

 的awk'NR == FNR {a [$ 1,$ 2] = $ 3;接下来}($ 1,$ 2)一个{打印$ 0时,[$ 1,$ 2]}'文件1文件2

查看:

  $ file1的猫
2L 5753 33158
2L 8813 33158
2L 7885 33159
2L 1279 33159
2L 5095 33158
$
$ file2的猫
2L 8813 0.6 1.2
2L 5762 0.4 0.5
2L 1279 0.5 0.9
$
$ AWK'NR == FNR {a [$ 1,$ 2] = $ 3;}旁边的{$打印0时,[$ 1,$ 2]}($ 1,$ 2)'文件1文件2
2L 8813 0.6 1.2 33158
2L 1279 0.5 0.9 33159
$

如果这不是你想要的,请澄清,或许张贴一些更重presentative样品输入/输出。

以上code的评论版,提供要求的解释:

 的awk'#启动脚本#中频在所有文件中到目前为止读的记录数等于
#在当前文件迄今读的记录数,一个
#条件这对于第一个文件中读取只能是真实的,那么
NR == FNR {   #填充阵列的a,使得通过所述第一索引的值
   #从文件1本记录2场是第三值
   从第一个文件#字段。
   一[$ 1,$ 2] = $ 3   #移动到下一个记录,所以我们不打算做任何处理
   #从第二个文件记录。这就像一个其他为
   #NR == FNR条件。
   下一个}#END THEN#我们仅当上述条件为假达到code的这一部分,
#即如果当前记录是从文件2,而不是从文件1。#中频从当前的第2字段构成的数组索引
#纪录阵列中的存在,因为如果存在这种价值观将发生
#在文件1,THEN
($ 1,$ 2)在{   #打印从文件2当前记录,然后从文件1的值
   #发生在拥有相同的值的记录的字段3
   #1场和场2文件1从文件2当前记录。
   打印$ 0,A [$ 1,$ 2]}#END THEN文件1文件2#END SCRIPT

希望有所帮助。

I have two files I'm trying to join/merge based on columns 1 and 2. They look something like this, with file1 (58210 lines) being much shorter than file2 (815530 lines) and I'd like to find the intersection of these two files based on fields 1 and 2 as an index:

file1:

2L      25753   33158
2L      28813   33158
2L      31003   33158
2L      31077   33161
2L      31279   33161
3L      32124   45339
3L      33256   45339
...

file2:

2L      20242   0.5     0.307692307692308
2L      22141   0.32258064516129        0.692307692307692
2L      24439   0.413793103448276       0.625
2L      24710   0.371428571428571       0.631578947368421
2L      25753   0.967741935483871       0.869565217391304
2L      28813   0.181818181818182       0.692307692307692
2L      31003   0.36    0.666666666666667
2L      31077   0.611111111111111       0.931034482758621
2L      31279   0.75    1
3L      32124   0.558823529411765       0.857142857142857
3L      33256   0.769230769230769       0.90625
...

I've been using the following couple of commands but end up with different numbers of lines:

awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file1 file2 | wc -l
awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file2 file1 | wc -l

I'm not sure why this happens, and I've tried sorting prior to comparison, just in case I have duplicate lines (based on columns 1 and 2) in either of the files, but it doesn't seem to help. (Any insights on why this is so are also appreciated)

How can I just merge the files so that just the lines of file2 that have the corresponding columns 1 and 2 in file1 get printed, with column 3 of file1 added on, to look something like this:

2L      25753   0.967741935483871       0.869565217391304    33158
2L      28813   0.181818181818182       0.692307692307692    33158
2L      31003   0.36    0.666666666666667    33158
2L      31077   0.611111111111111       0.931034482758621    33161
2L      31279   0.75    1    33161
3L      32124   0.558823529411765       0.857142857142857    45339
3L      33256   0.769230769230769       0.90625    45339

解决方案

awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

Look:

$ cat file1
2L      5753   33158
2L      8813   33158
2L      7885   33159
2L      1279   33159
2L      5095   33158
$
$ cat file2
2L      8813    0.6    1.2
2L      5762    0.4    0.5
2L      1279    0.5    0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L      8813    0.6    1.2 33158
2L      1279    0.5    0.9 33159
$

If that's not what you want, please clarify and perhaps post some more representative sample input/output.

Commented version of the above code to provide requested explanation:

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
#    to the number of records read so far in the current file, a
#    condition which can only be true for the first file read, THEN 
NR==FNR {

   # populate array "a" such that the value indexed by the first
   # 2 fields from this record in file1 is the value of the third
   # field from the first file.
   a[$1,$2]=$3

   # Move on to the next record so we don't do any processing intended
   # for records from the second file. This is like an "else" for the
   # NR==FNR condition.
   next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
#    record exist in array a, as would occur if these same values existed
#    in file1, THEN
($1,$2) in a {

   # print the current record from file2 followed by the value from file1
   # that occurred at field 3 of the record that had the same values for
   # field 1 and field 2 in file1 as the current record from file2.
   print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

Hope that helps.

这篇关于适当加入两个文件的基础上共同两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆