用匹配的键合并行 [英] Combine lines with matching keys
问题描述
我有一个具有以下结构的文本文件
I have a text file with the following structure
ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
每个ID
对应于操作员多次分析的数据集.即J1
和J2
是运算符J的第一次和第二次尝试.度量a
,b
,c
和d
使用4种略有不同的算法来度量一个值,其真实值位于列true
Each ID
corresponds to a dataset which is analysed by an operator several times. i.e J1
and J2
are the first and second attempt by operator J. The measures a
, b
, c
and d
use 4 slightly different algorithms to measure a value whose true value lies in the column true
我想做的是创建3个新的文本文件,比较J1
vs J2
,S1
vs S2
和J1
vs S1
的结果. J1
与J2
的示例输出:
What I would like to do is to create 3 new text files comparing the results for J1
vs J2
, S1
vs S2
and J1
vs S1
. Example output for J1
vs J2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3
其中a1
是J1
等的测量a
.
另一个示例是S1
与S2
的对比:
Another example is for S1
vs S2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3
ID不会按字母数字顺序排列,也不会为相同的ID聚集运算符.我不确定如何最好地完成此任务-使用linux工具或诸如perl/python之类的脚本语言.
The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python.
我最初使用linux的尝试很快遇到了障碍
My initial attempt using linux quickly hit a brick wall
首先找到所有唯一ID(已排序)
First find all unique IDs (sorted)
awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids
浏览这些ID,然后对J1
,J2
进行排序:
Loop through these IDs and sort J1
, J2
:
foreach i (`more unique_ids`)
grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end
这给了我排序后的数据
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3
我不确定如何重新排列此数据以获得所需的结构.我尝试在foreach
循环awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'
I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk
in the foreach
loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'
有什么想法吗?我敢肯定,使用awk
可以用不太麻烦的方式完成此操作,尽管使用适当的脚本语言可能会更好.
Any ideas? I'm sure this can be done in a less cumbersome manner using awk
, although it may be better using a proper scripting language.
推荐答案
You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop.
use strict;
use warnings;
use Text::CSV;
my %data;
my $csv = Text::CSV->new({
binary => 1, # safety precaution
eol => $/, # important when using $csv->print()
});
while ( my $row = $csv->getline(*ARGV) ) {
my ($id, $J) = @$row; # first two fields
$data{$id}{$J} = $row; # store line
}
这篇关于用匹配的键合并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!