用匹配的键合并行 [英] Combine lines with matching keys

查看:57
本文介绍了用匹配的键合并行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的文本文件

I have a text file with the following structure

ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1

每个ID对应于操作员多次分析的数据集.即J1J2是运算符J的第一次和第二次尝试.度量abcd使用4种略有不同的算法来度量一个值,其真实值位于列true

Each ID corresponds to a dataset which is analysed by an operator several times. i.e J1 and J2 are the first and second attempt by operator J. The measures a, b, c and d use 4 slightly different algorithms to measure a value whose true value lies in the column true

我想做的是创建3个新的文本文件,比较J1 vs J2S1 vs S2J1 vs S1的结果. J1J2的示例输出:

What I would like to do is to create 3 new text files comparing the results for J1 vs J2, S1 vs S2 and J1 vs S1. Example output for J1 vs J2:

ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3

其中a1J1等的测量a.

另一个示例是S1S2的对比:

Another example is for S1 vs S2:

ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3

ID不会按字母数字顺序排列,也不会为相同的ID聚集运算符.我不确定如何最好地完成此任务-使用linux工具或诸如perl/python之类的脚本语言.

The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python.

我最初使用linux的尝试很快遇到了障碍

My initial attempt using linux quickly hit a brick wall

首先找到所有唯一ID(已排序)

First find all unique IDs (sorted)

awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids

浏览这些ID,然后对J1J2进行排序:

Loop through these IDs and sort J1, J2:

foreach i (`more unique_ids`)
    grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end

这给了我排序后的数据

WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3

我不确定如何重新排列此数据以获得所需的结构.我尝试在foreach循环awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'

I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk in the foreach loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'

有什么想法吗?我敢肯定,使用awk可以用不太麻烦的方式完成此操作,尽管使用适当的脚本语言可能会更好.

Any ideas? I'm sure this can be done in a less cumbersome manner using awk, although it may be better using a proper scripting language.

推荐答案

您可以使用Perl csv模块

You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop.

use strict;
use warnings;
use Text::CSV;

my %data;
my $csv = Text::CSV->new({
            binary => 1,      # safety precaution
            eol    => $/,     # important when using $csv->print()
    });
while ( my $row = $csv->getline(*ARGV) ) {
    my ($id, $J) = @$row;   # first two fields
    $data{$id}{$J} = $row;  # store line
}

这篇关于用匹配的键合并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆