用awk一个近邻 [英] One nearest neighbour using awk

查看:129
本文介绍了用awk一个近邻的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我试图用AWK语言做的。我有一个主要步骤2中我都表现出了样本数据集,但原始数据集包括100场和2000记录的问题。

算法

1)初始化精度= 0

2)每个记录r

 使用距离公式寻找最近的其他纪录,O,数据集中

要找到R0的近邻,我需要R1至R9比较R0和做数学题如下:广场(ABS(r0.c1 - r1.c1))+方(ABS(r0.c2 - r1.c2))+ ... +方(ABS(r0.c5 - r1.c5))
和存储那些距离。

3)用一分钟的距离,比较其C6值。如果C6值相等增量的准确性1。

重复该过程对于所有的记录之后。

4)最后,获得由1NN准确率
(精度/ total_records)* 100;

样本数据集

  C1 C2 C3 C4 C5 C6  - >列
  R0 0.19 0.33 0.02 0.90 0.12 0.17 - > ROW1&安培; row7在C1近邻
  R1 0.34 0.47 0.29 0.32 0.20 1.00,相同的值在C6(0.3),所以++精度
  R2 0.37 0.72 0.34 0.60 0.29 0.15
  R3 0.43 0.39 0.40 0.39 0.32 0.27
  R4 0.27 0.41 0.08 0.19 0.10 0.18
  R5 0.48 0.27 0.68 0.23 0.41 0.25
  R6 0.52 0.68 0.40 0.75 0.75 0.35
  R7 0.55 0.59 0.61 0.56 0.74 0.76
  R8 0.04 0.14 0.03 0.24 0.27 0.37
  R9 0.39 0.07 0.07 0.08 0.08 0.89

code

  BEGIN {
            #initialize准确性和total_records
            精度= 0;
            total_records = 10;
        }
NR == FNR {#遍历每个记录,并将其存储在一个数组
                对于(i = 1; I< = NF;我++)
                {
                     记录[I] = $ I;
                }
            下一个
        }        {#重新循环通过文件和每个记录在一个文件中与每个记录阵列比较
              对于(i = 1; I< =长度(记录);我++)
              {
                   为(J = 1; J< = NF; J ++)
                   {#这里我需要每一个所有记录,以获得创纪录的[I]的各个领域的不同,方他们总结一下。
                          距离[J] =(记录[I] - $ j)条^ 2;
                   }
               #Once我所有的距离,我可以简单地比较field_6的值与最小距离的纪录。
              如果(分(距离[J]))
              {
                  如果(记录[$ 6] == $ 6)
                  {
                        ++准确性;
                  }
              }
       }
结束{
     百分比= 100 *(精度/ total_records);
     打印百分比;
}


解决方案

下面是方法之一。

  $猫-n文件>尽管nfile
$加入nfile的{,} -j99 |
  awk的'函数ABS(x){返回X大于0 X:?-x}
           $ 1 LT; $ {8个= MINC 999;对于(i = 2;我7;;我++)
                 {D = ABS($ I - $(I + 7));
                  如果(D< MINC)MINC = D}
                  打印$ 1,MINC,$ 7 == $ 14}'|
  排序-u -k1,2 -k3r |
  AWK'!一个[$ 1] + {总和+ = $ 3} END {打印总和}7

由于对称你只需要比较N *(N-1)/ 2的记录,更容易与加盟prepare所有比赛设置并过滤掉多余的人 $ 1&LT ; $ 8个,发现每条记录的最小列距离和记录最后场比赛 $ 7 == $ 14个,查找每个最小距离首先记录的数量和距离记录排序,最终得到匹配项的总和。

下面为您的配方我猜的结果将是 100 * 2 * 7/10 = 140%因为你重复计算(R1〜R7和R7〜R1 ),否则 70%

更新结果
随着新的距离函数,该脚本可以被重新写为

  $加入nfile的{,} -j999 |
  awk的'$ 1< $ 8个{D = 0;
              为(ⅰ= 2; I&下; 7;我+ +)D + =($ I - $第(i + 7))^ 2;
              打印$ 1,D,$ 7 == $ 14}'|
  排序-k1,2n -k3r |
  AWK'一[$ 1] + {总和+ = $ 3;统计++}!
            END {打印100 *金额/(COUNT + 1)%}'70%

说明

猫-n文件> nfile的创建一个创纪录的数字一个新的文件。 加入不能把从标准输入两个文件,​​所以你必须创建一个临时文件。

加入nfile的{,} -j999 记录积(每个记录将与每一条记录(两个嵌套循环的类似效果)

加盟

$ 1< $ 8个将筛选出记录到交叉产品的上三角部分(如果你把它想象成一个二维矩阵)

为(i = 2;我7;;我++)D + =($ I - $(I + 7))^ 2; 计算每个距离的平方相对于其他记录

打印$ 1,D,$ 7 == $ 14个自纪录,距离平方和指示灯的最后打印领域是否匹配

排序-u -k1,2 -k3r 找到最小的每个记录,排序第3场逆转,使1将是第一个,如果有任何

A [$ 1] + {总和+ = $ 3;统计++} 计算行和总结的指标分别从记录

END {打印100 *金额/(计数+ 1)%} 字段的数量为一比的记录越多,转换为百分比格式。

我建议明白是怎么回事分阶段运行的每个管道部分,并尝试验证中间结果。

有关您的真实资料,你必须更改硬盘codeD参考值。加盟领域应该比你的领域更个性化。

This is what I am trying to do using AWK language. I have a problem with mainly step 2. I have shown a sample dataset but the original dataset consists of 100 fields and 2000 records.

Algorithm

1) initialize accuracy = 0

2) for each record r

     Find the closest other record, o, in the dataset using distance formula

To find the nearest neighbour for r0, I need to compare r0 with r1 to r9 and do math as follows: square(abs(r0.c1 - r1.c1)) + square(abs(r0.c2 - r1.c2)) + ...+square(abs(r0.c5 - r1.c5)) and store those distance.

3) One with min distance, compare its c6 values. if c6 values are equal increment the accuracy by 1.

After repeating the process for all the records.

4) finally, Get the 1nn accuracy percentage by (accuracy/total_records) * 100;

Sample Dataset

        c1   c2   c3   c4   c5   c6  --> Columns
  r0  0.19 0.33 0.02 0.90 0.12 0.17  --> row1 & row7 nearest neighbour in c1
  r1  0.34 0.47 0.29 0.32 0.20 1.00      and same values in c6(0.3) so ++accuracy
  r2  0.37 0.72 0.34 0.60 0.29 0.15 
  r3  0.43 0.39 0.40 0.39 0.32 0.27 
  r4  0.27 0.41 0.08 0.19 0.10 0.18 
  r5  0.48 0.27 0.68 0.23 0.41 0.25 
  r6  0.52 0.68 0.40 0.75 0.75 0.35 
  r7  0.55 0.59 0.61 0.56 0.74 0.76 
  r8  0.04 0.14 0.03 0.24 0.27 0.37 
  r9  0.39 0.07 0.07 0.08 0.08 0.89

Code

BEGIN   {
            #initialize accuracy and total_records
            accuracy = 0;
            total_records = 10;
        }


NR==FNR {    # Loop through each record and store it in an array
                for (i=1; i<=NF; i++) 
                {
                     records[i]=$i;
                }
            next             
        }

        {   # Re-Loop through the file and compare each record from the array with each record in a file    
              for(i=1; i <= length(records); i++)
              {
                   for (j=1; j<=NF; j++) 
                   {      # here I need to get the difference of each field of the record[i] with each all the records, square them and sum it up. 
                          distance[j] = (records[i] - $j)^2;
                   }
               #Once I have all the distance, I can simply compare the values of field_6 for the record with least distance.
              if(min(distance[j]))
              {
                  if(records[$6] == $6)
                  {
                        ++accuracy;
                  } 
              }
       }
END{
     percentage = 100 * (accuracy/total_records); 
     print percentage;
}

解决方案

Here is one approach

$ cat -n file > nfile
$ join nfile{,} -j99 | 
  awk 'function abs(x) {return x>0?x:-x}  
           $1<$8 {minc=999;for(i=2;i<7;i++) 
                 {d=abs($i-$(i+7)); 
                  if(d<minc)minc=d} 
                  print $1,minc,$7==$14}' | 
  sort -u -k1,2 -k3r | 
  awk '!a[$1]++{sum+=$3} END{print sum}'

7

due to symmetry you just need to compare n*(n-1)/2 records, easier to set it up with join to prepare all matches and filter out the redundant ones $1<$8, finds the min column distance per record and record the match of the last fields $7==$14, to find the minimum distance for each record sort by first record number and distance, finally get the sum of the matched entries.

Here for your formulation I guess the result will be 100*2*7/10=140% since you're double counting (R1~R7 and R7~R1), otherwise 70%

UPDATE
With the new distance function, the script can be re-written as

$ join nfile{,} -j999 | 
  awk '$1<$8 {d=0; 
              for(i=2;i<7;i++) d+=($i-$(i+7))^2; 
              print $1,d,$7==$14}' | 
  sort -k1,2n -k3r | 
  awk '!a[$1]++{sum+=$3;count++} 
            END{print 100*sum/(count+1)"%"}'

70%

Explanation

cat -n file > nfile create a new file with record numbers. join can't take both files from stdin, so you have to create a temporary file.

join nfile{,} -j999 cross product of records (each record will be joined with every record (similar effect of two nested loops)

$1<$8 will filter out the records to upper triangular section of the cross product (if you imagine it as a 2D matrix).

for(i=2;i<7;i++) d+=($i-$(i+7))^2; calculate the distance square of each record with respect to others

print $1,d,$7==$14 print from record, distance square, and indicator whether last fields match

sort -u -k1,2 -k3r find the min for each record, sort 3rd field reverse so that 1 will be first if there is any.

a[$1]++{sum+=$3;count++} count rows and sum the indicators for each from record

END{print 100*sum/(count+1)"%"} the number of fields is one more than from records, convert to percent formatting.

I suggest to understand what is going on run each piped section in stages and try to verify the intermediate results.

For your real data you have to change the hard coded reference values. Joined field should be more than your field count.

这篇关于用awk一个近邻的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆