找到最接近值:多列条件 [英] Find the closest values: Multiple columns conditions

查看:187
本文介绍了找到最接近值:多列条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

随着我的第一个问题<一href=\"http://stackoverflow.com/questions/29827630/match-closest-value-from-two-different-files-and-print-specific-columns\">here我想延长的条件从第一和第二列的两个不同的文件找到最接近的值,并打印特定列。

文件1

  1 2 3 4 A1
1 4 5 6 B1
8 5 9 11 C1

文件2

  1 1 3
1 2 5 B
1 2.1 4℃
1 4 6 D
2 4 5ë
9 4 1纤维
9 52克
9 6 2小时
11 10 14我
11月15日5焦耳

因此​​,例如,我需要找到从$ 1最接近的值在文件中各为2 $ 1文件1,但再搜索最接近也为$ 2。

输出:

  1 2 A1 *
1 2 B *
1 4 B1
1 4 D
8 5 C1
95克

*因为对第1列的最接近的值(从文件2的第1列)(文件1)首先柱文件1和第2列文件2是1,以及第2条件是,也必须是最接近的值第二列是这种情况下是2而我打印$ 1,$ 2,文件1 5 $和$ 1,$ 2,从文件2 $ 4'/ p>

有关的另一个输出是相同的步骤。

解决发现这是我在其他岗位,被@Tensibai给出最接近的一次。
但是,任何解决方案,将工作。
谢谢!


解决方案

听起来有点令人费解,但作品:

 功能最接近(数组,搜索){
  距离= 999999; #这应该是高于最高指数,以避免返回null
  拆分(搜查,skeys,OFS)
  #获取关键的第一部分
  对(在数组x){#循环阵列上得到它的钥匙
    拆分(X,mkeys,OFS)#分裂数组键
    (mkeys [1] + 0&GT; skeys [1] + 0)? TMP = mkeys [1] - skeys [1]:TMP = skeys [1] - mkeys [1]#+ 0比较整数,三元操作以减少code,计算出密钥和所述目标之间的差异
    如果(TMP&LT;距离){#如果距离如果低于preceding,更新
      距离= TMP
      =,招致mkeys [1]#,保存键居然发现最接近
    }
  }
  #在这一点上,我们有钥匙的第一部分中,让我们重做工作,为第二部分
  距离= 999999;
  对(在数组x){
    拆分(X,mkeys,OFS)
    如果(mkeys [1] ==,招致){关键的第一部分,#过滤器
      (mkeys [2] + 0&GT; skeys [2] + 0)? TMP = mkeys [2] - skeys [2]:TMP = skeys [2] - mkeys [2]#+ 0比较整数,三元操作以减少code,计算出密钥和所述目标之间的差异
      如果(TMP&LT;距离){#如果距离如果低于preceding,更新
        距离= TMP
        found2 = mkeys [2]#,保存键居然发现最接近
      }    }
  }
  #现在我们拿到了第二场,活泉
  回报(OFS,招致found2)#从出两个搜索返回组合键
}{
   如果(NR&GT; FNR){#如果我们改变了文件(文件编号记录小于号记录)改变阵列     B〔($ 1 $ OFS 2)] = $ 4#使以$ 1 $ 2作为关键和4 $作为值数组
   }其他{
     键=($ 1 OFS $ 2)#使钥匙避免过多的计算访问它以后
     akeys [最大++] =键#存储数组键,以确保秩序月底为(以数组x)不保证顺序
     一个[关键] = $ 5#使用钥匙存放previously一个数组和价值$ 5
   }}END {#现在我们结束解析两个文件,​​打印出结果
  对(我akeys){#循环数组键的其中有一个数字索引过来,维持秩序
    打印akeys [I]中,[akeys [I]]#打印为第一阵列值(键则值)
    如果(akeys [i]于二){#如果相同的密钥在第二文件中存在
      打印akeys [I],B [akeys [I]]#然后打印
    }其他{
      bindex最接近=(B,akeys [I])#调用该函数找到第二个文件最接近的关键
      打印bindex,B [bindex]#打印的内容,我们发现
    }
  }
}

请注意我用OFS到田间地头,所以如果你改变它的输出将正确的行为结合起来。

警告:这应该做相对短的文件,但现在从第二个文件阵列走过两次,这将是长期的两次,每次搜索警告的终止

有对您的文件进行排序更好的搜索算法的地方(但不是在previous问题的情况下,你希望保留在文件中的顺序)。在这种情况下,第一个改进,打破了循环时,远程启动比preceding大一个。

从示例文件输出:

  $ mawk -f closest2.awk F1 F2
1 2 A1
1 2 B
1 4 B1
1 4 D
8 5 C1
95克

Following my first question here I want to extend the condition of find the closest value from two different files of the first and second column, and print specific columns.

File1

1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1

File 2

1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d 
2 4 5 e
9 4 1 f 
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j

So for example I need to find the closest value from $1 in file 2 for each $1 in file 1 but then search the closest also for $2.

Output:

1 2 a1*
1 2 b*
1 4 b1 
1 4 d 
8 5 c1 
9 5 g 

* First column file 1 and 2nd column file 2 because for the 1st column (of file 1) the closest value (from the 1st column of file 2) is 1, and the 2nd condition is that also must be the closest value for the second column which is this case is 2. And I print $1,$2,$5 from file 1 and $1,$2,$4 from file 2

For the other output is the same procedure.

The solution to find the closest it is in my other post and was given by @Tensibai. But any solution will work. Thanks!

解决方案

Sounds a little convoluted but works:

function closest(array,searched) {
  distance=999999; # this should be higher than the max index to avoid returning null
  split(searched,skeys,OFS)
  # Get the first part of key
  for (x in array) { # loop over the array to get its keys
    split(x,mkeys,OFS) # split the array key
    (mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
    if (tmp < distance) { # if the distance if less than preceding, update
      distance = tmp
      found1 = mkeys[1] # and save the key actually found closest
    }
  }
  # At this point we have the first part of key found, let's redo the work for the second part
  distance=999999;
  for (x in array) {
    split(x,mkeys,OFS)
    if (mkeys[1] == found1) { # Filter on the first part of key
      (mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
      if (tmp < distance) { # if the distance if less than preceding, update
        distance = tmp
        found2 = mkeys[2] # and save the key actually found closest
      }

    }
  }
  # Now we got the second field, woot
  return (found1 OFS found2)  # return the combined key from out two search
}

{
   if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array

     b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
   } else {
     key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
     akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
     a[key] = $5 # make an array with the key stored previously and $5 as value
   }

}

END { # Now we ended parsing the two files, print the result
  for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
    print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
    if (akeys[i] in b) { # if the same key exist in second file
      print akeys[i],b[akeys[i]] # then print it
    } else {
      bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
      print bindex,b[bindex] # print what we found
    }
  }
}

Note I'm using OFS to combine the fields so if you change it for output it will behave properly.

WARNING: This should do with relative short files, but as now the array from second file is traversed twice it will be twice long for each searchEND OF WARNING

There's place for a better search algorithm if your files are sorted (but it was not the case on previous question and you wished to keep the order from the file). First improvement in this case, break the for loop when distance start to be greater than preceding one.

Output from your sample files:

$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

这篇关于找到最接近值:多列条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆