通过awk提取列范围并重构矩阵 [英] Extracting column ranges and reconstituting matrix via awk

查看:402
本文介绍了通过awk提取列范围并重构矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设一个文本文件(file1)包含 m 行字母字符串 S ( S_1 S_2 ,..., S_m ).每个 S 前面都有一个短字母数字字符串,用作条形码(此处为: foo1 bar7 baz3 ).字母字符串 S 的长度都相同.每个 S 及其前面的条形码用空格隔开.

Assume a text file (file1) that contains m lines of alphabetic strings S (S_1, S_2, ..., S_m). Each S is preceded by a short alphanumeric string that acts as a barcode (here: foo1, bar7, baz3). The alphabetic strings S are all identical in length. Each S and its preceding barcode is separated by a whitespace.

$ cat file1
foo1 abcdefghijklmnopqrstuvwxyz
bar7 abcdefghijklmnopqrstuvwxyz
baz3 abcdefghijklmnopqrstuvwxyz

假设第二个文件(file2)包含列范围 R ( R_1 R_2 ,..., R_n ).列范围在一行上,并由空格分隔.每个 R_x 均小于 S .范围的组合长度(即 R_1 + R_2 + ... + R_n )也小于 S .这些范围都没有重叠或彼此不构成子集.

Assume a second file (file2) that contains n specifications of column ranges R (R_1, R_2, ..., R_n). The column ranges are on a single line and separated by whitespaces. Each R_x is smaller than S. The combined lengths of the ranges (i.e., R_1 + R_2 + ... + R_n) is also smaller than S. None of the ranges overlap or constitute a subset of each other.

$ cat file2
2-11 14-19 23-24

这个出色的答案之后,我知道我可以提取第一个范围(即, R_1 awk 命令来访问所有 S 的代码,同时保留正确分配的条形码:

Following this excellent answer, I understand that I can extract the first range (i.e., R_1) of all S via the following awk command, while keeping the barcodes correctly assigned:

awk 'NR==FNR{start=$1;lgth=$2;next} {print $1, substr($2,start,lgth)}' FS='-' file2 FS=' ' file1

但是,我不确定如何扩展 awk 代码以在所有其他范围(此处为 R_2 R_3 )上循环将它们附加到增长的矩阵中.

However, I am uncertain how to expand the awk-code to loop over all other ranges (here: R_2 and R_3) and append each to the growing matrix.

$ sought_outcome
foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx

为了更好地理解,下面显示了所寻求的输出,以便通过空格强调连接点:

For better understanding, here is the sought output illustrated such that the concatenation points are emphasized by whitespaces:

     2-11       14-19  23-24
foo1 bcdefghijk nopqrs wx
bar7 bcdefghijk nopqrs wx
baz3 bcdefghijk nopqrs wx

推荐答案

awk进行救援!没有任何验证检查!

awk to the rescue! without any validation checks!

$ awk 'NR==FNR {printf "%s", "key"; 
                for(i=1;i<=NF;i++) 
                  {split($i,x,"-"); 
                   start[i]=x[1]; 
                   end[i]  =x[2]; 
                   printf "%s", FS $i}; 
                print ""; 
                next} 

               {printf "%s", $1; 
                for(i in start) printf "%s", FS substr($2,start[i],end[i]-start[i]+1); 
                print ""}' range file | 
  column -t


key   2-11        14-19   23-24
foo1  bcdefghijk  nopqrs  wx
bar7  bcdefghijk  nopqrs  wx
baz3  bcdefghijk  nopqrs  wx

或者,不包含标题和拆分

or, without the header and splitting

$ awk 'NR==FNR{for(i=1;i<=NF;i++) 
                 {split($i,x,"-"); start[i]=x[1]; end[i]=x[2]}; 
                  print ""; n=NF; next}
              {printf "%s", $1 FS; 
               for(i=1;i<=n;i++) printf "%s", substr($2,start[i],end[i]-start[i]+1); print ""}' range file   column -t                        

foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx

更新 但是,剪切/粘贴可能更容易

UPDATE However, perhaps easier with cut/paste

$ paste -d' ' <(cut -d' ' -f1 file) <(cut -d' ' -f2 file | cut -c$(tr ' ' ',' <range))
foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx

这篇关于通过awk提取列范围并重构矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆