在awk中处理文本文件并制作一个新文件 [英] handling text file in awk and making a new file

查看:192
本文介绍了在awk中处理文本文件并制作一个新文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这个小例子的文本文件:

I have a text file like this small example:

chr10:102721669-102724893   3217    3218    5
chr10:102721669-102724893   3218    3219    1
chr10:102721669-102724893   3219    3220    5
chr10:102721669-102724893   421 422 1
chr10:102721669-102724893   858 859 2
chr10:102539319-102568941   13921   13922   1
chr10:102587299-102589074   1560    1561    1
chr10:102587299-102589074   1565    1566    1
chr10:102587299-102589074   1595    1596    1
chr10:102587299-102589074   944 945 1

预期输出如下:

chr10:102721669-102724893   3217    3218    5   CA
chr10:102721669-102724893   3218    3219    1   CA
chr10:102721669-102724893   3219    3220    5   CA
chr10:102721669-102724893   421 422 1   BA
chr10:102721669-102724893   858 859 2   BA
chr10:102539319-102568941   13921   13922   1   NON
chr10:102587299-102589074   1560    1561    1   CA  
chr10:102587299-102589074   1565    1566    1   CA
chr10:102587299-102589074   1595    1596    1   CA
chr10:102587299-102589074   944 945 1   BA

输入具有4 tab separated列,在输出中,我还有另外一列具有3个不同的类(CA, NON or BA). 1-如果不重复输入中的1st column,则在输出中的5th column中将其分类为NON 2-如果(the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is smaller than -30 (meaning -31 or smaller), that line will be classified as BA.例如在最后一行: (102587299 + 944) - 102589074 = -831 , so this line is classified as BA.

the input has 4 tab separated columns and in the output, I have one more column with 3 different class (CA, NON or BA). 1- if the 1st column in the input is not repeated, in the 5th column of output it will be classified as NON 2- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is smaller than -30 (meaning -31 or smaller), that line will be classified as BA. for example in the last line: (102587299 + 944) - 102589074 = -831 , so this line is classified as BA.

3-如果(the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is equal or bigger than -30 (meaning -30 or -29), that line will be classified as CA.例如第一行:

3- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is equal or bigger than -30 (meaning -30 or -29), that line will be classified as CA. for example the 1st line:

(102721669 + 3217) - 102724893 = -7

我正在尝试在awk中做到这一点.

I am trying to do that in awk.

awk -F "\t"":""-" '{if($2+$4-$3 < -30) ; print $7 = BA,  if($2+$4-$3 >= -30) ; print $7 = CA}' file.txt > out.txt

,但不会返回我期望的结果.你知道如何解决吗?

but it does not returns what I expect. do you know how to fix it?

推荐答案

尝试

$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1]++; next}
       { split($1, b, /[\t:-]/);
         $5 = a[$1]==1 ? "NON" : (b[2]+$2-b[3]) < -30 ? "BA" : "CA" }
       1' file.txt file.txt
chr10:102721669-102724893   3217    3218    5   CA
chr10:102721669-102724893   3218    3219    1   CA
chr10:102721669-102724893   3219    3220    5   CA
chr10:102721669-102724893   421 422 1   BA
chr10:102721669-102724893   858 859 2   BA
chr10:102539319-102568941   13921   13922   1   NON
chr10:102587299-102589074   1560    1561    1   BA
chr10:102587299-102589074   1565    1566    1   BA
chr10:102587299-102589074   1595    1596    1   BA
chr10:102587299-102589074   944 945 1   BA

  • BEGIN{FS=OFS="\t"}将两个输入/输出字段分隔符都设置为制表符
  • NR==FNR{a[$1]++; next}计算第一个字段在文件中出现的次数.输入文件被传递了两次,因此在第二遍时,我们可以基于计数做出决定
  • split($1, b, /[\t:-]/)进一步拆分第一列,结果保存在b数组中
  • 其余代码将根据给定条件分配第5个字段并打印修改后的行
    • BEGIN{FS=OFS="\t"} set both input/output field separator as tab
    • NR==FNR{a[$1]++; next} to count how many times first field is present in the file. Input file is passed twice, so that on second pass we can make decision based on count
    • split($1, b, /[\t:-]/) split the first column further, results saved in b array
    • rest of the code is assigning 5th field depending on given conditions and printing the modified line

    • 进一步阅读

      • Idiomatic awk
      • split function

      这篇关于在awk中处理文本文件并制作一个新文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆