从柱用Perl或bash解析 [英] parsing from colum in perl or bash

查看:157
本文介绍了从柱用Perl或bash解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我与看起来像这样的工作文件

 姓名N0 N1 N2 N3 N4 N5 N6 N7
地区CHR 1 100000
404 AAAAAAGA
992 TTTTTTTA
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC

将需要基于列中的数字文件分割 - 创建一个新的文件,每1000个单位所以输出将e为

 文件1
 名字N0 N1 N2 N3 N4 N5 N6 N7
    区CHr 404 992
    404 AAAAAAGA
    992 TTTTTTTA文件2
 名字N0 N1 N2 N3 N4 N5 N6 N7
     区CHr 1146 1778
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC

所以分割所述第一柱每1000单位(首先是从1至1000)文件2是从1000至2000还将启动一个端部位置将在每一个文件被改变(线起始REG)作为第一个数字是在文件ADN对方号码的第一行号是在HTE文件的最后一行的数目。头需要在所有文件present。有没有一种方法来命名从系统与文件1,文件2文件....? / T用于在所有文件,以腾出空间...

我试图

 的awk'
NR == 1 {
   H = $ 0个
   K = 1000
   F =文件K / 1000
   打印> F
   函数getline
   打印区域CHR,K-999,K> F
   下一个
}
$ 1所述; = K {
   打印> F
   下一个
}
{
   K = 1000 * INT(1 + $ 1/1000)
   F =文件K / 1000
   打印H> F
   打印区域CHR,K-999,K> F
   打印> F
}'文件


解决方案

您有一个 AWK 的答案,但这个问题被标记 perl的我将在一个Perl一筹了。

 #!的/ usr / bin中/ perl的包膜
使用严格的;
使用警告;我见过%;我的$头=<> 。 &所述;取代;
打印头$;我的$ last_sequence_number = 0;打开(我的$输出,>中,输出$ last_sequence_number.out)或死亡$ !;
打印{$输出} $头;
$看到{$ last_sequence_number} ++;而(小于&GT){
    我($键)=拆分;
    接下来,除非$键=〜M / ^ \\ D + $ /;
    我的$ sequence_number = INT($键/ 1000);
    如果(没有$ sequence_number == $ last_sequence_number){
        打印打开新文件$ sequence_number \\ N的;
        关闭($输出);
        打开($输出,>中,输出$ sequence_number.out)或死亡$ !;
        打印{$输出} $头除非$看到{$ sequence_number} ++;
        $ last_sequence_number = $ sequence_number;
    }
    打印{$}输出$ _;
}

这样做是:


  • 读两行从您的输入找出头。

  • 通过输入其余运行,提取'数位。

  • 1000分就想出一个文件号写入。

  • 打开该新文件,如果是相关的。 (如果这是第一次它这样做,写一些头)。

  • 打印当前行到当前打开的文件。

通过任何管道或调用 myscript.pl<&名GT;

a file I am working with looks like this

NAMES   n0  n1  n2  n3  n4  n5  n6  n7
REGION  chr 1   100000
404 AAAAAAGA
992 TTTTTTTA
1146    CCCCGGCC
1727    CCCCCACC
1778    GCCCCCCC

would need to split the file based on the number in the column - create a new file for every 1000 units so the output would e be

file1
 NAMES  n0  n1  n2  n3  n4  n5  n6  n7
    REGION  chr 404 992
    404 AAAAAAGA
    992 TTTTTTTA

file2
 NAMES  n0  n1  n2  n3  n4  n5  n6  n7
     REGION chr 1146    1778
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC

so split the first colum every 1000 units (first is from 1 to 1000) file 2 is from 1000 to 2000 also the start an end positions would be changed in every file (line starting with REG) as the first number is the number in the first line of the file adn the other number is the number in the last line of hte file. The header needs to be present in all files. Is there a way to name the files from that systematically with file1, file2....? /t is used throughout all files to make space...

i tried

awk '
NR==1 {
   h = $0
   k = 1000
   f = "file"k/1000
   print > f
   getline
   print "REGION chr",k-999,k > f
   next
} 
$1 <=k {
   print > f
   next
} 
{
   k=1000*int(1+$1/1000)
   f="file"k/1000
   print h > f
   print "REGION chr",k-999,k > f
   print > f
}' file

解决方案

You have an awk answer, but as this question is tagged perl I'll chip in a perl one too.

#!/usr/bin/env perl
use strict;
use warnings;

my %seen;

my $header = <> . <>;
print $header;

my $last_sequence_number = 0;

open( my $output, ">", "output.$last_sequence_number.out" ) or die $!;
print {$output} $header;
$seen{$last_sequence_number}++;

while (<>) {
    my ($key) = split;
    next unless $key =~ m/^\d+$/;
    my $sequence_number = int( $key / 1000 );
    if ( not $sequence_number == $last_sequence_number ) {
        print "Opening new file for $sequence_number\n";
        close($output);
        open( $output, ">", "output.$sequence_number.out" ) or die $!;
        print {$output} $header unless $seen{$sequence_number}++;
        $last_sequence_number = $sequence_number;
    }
    print {$output} $_;
}

What this does is:

  • read two lines from your input to figure out the headers.
  • run through the rest of the input, extracting the 'number bit'.
  • divides it by 1000 to figure out a 'file number' to write to.
  • opens a new file for that if it's relevant. (And if it's the first time it's done so, writes some headers).
  • prints the current line to the currently open file.

Invoke via either a pipe or myscript.pl <filename>

这篇关于从柱用Perl或bash解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆