从柱用Perl或bash解析 [英] parsing from colum in perl or bash
问题描述
我与看起来像这样的工作文件
姓名N0 N1 N2 N3 N4 N5 N6 N7
地区CHR 1 100000
404 AAAAAAGA
992 TTTTTTTA
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC
将需要基于列中的数字文件分割 - 创建一个新的文件,每1000个单位所以输出将e为
文件1
名字N0 N1 N2 N3 N4 N5 N6 N7
区CHr 404 992
404 AAAAAAGA
992 TTTTTTTA文件2
名字N0 N1 N2 N3 N4 N5 N6 N7
区CHr 1146 1778
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC
所以分割所述第一柱每1000单位(首先是从1至1000)文件2是从1000至2000还将启动一个端部位置将在每一个文件被改变(线起始REG)作为第一个数字是在文件ADN对方号码的第一行号是在HTE文件的最后一行的数目。头需要在所有文件present。有没有一种方法来命名从系统与文件1,文件2文件....? / T用于在所有文件,以腾出空间...
我试图
的awk'
NR == 1 {
H = $ 0个
K = 1000
F =文件K / 1000
打印> F
函数getline
打印区域CHR,K-999,K> F
下一个
}
$ 1所述; = K {
打印> F
下一个
}
{
K = 1000 * INT(1 + $ 1/1000)
F =文件K / 1000
打印H> F
打印区域CHR,K-999,K> F
打印> F
}'文件
您有一个 AWK
的答案,但这个问题被标记 perl的
我将在一个Perl一筹了。
#!的/ usr / bin中/ perl的包膜
使用严格的;
使用警告;我见过%;我的$头=<> 。 &所述;取代;
打印头$;我的$ last_sequence_number = 0;打开(我的$输出,>中,输出$ last_sequence_number.out)或死亡$ !;
打印{$输出} $头;
$看到{$ last_sequence_number} ++;而(小于&GT){
我($键)=拆分;
接下来,除非$键=〜M / ^ \\ D + $ /;
我的$ sequence_number = INT($键/ 1000);
如果(没有$ sequence_number == $ last_sequence_number){
打印打开新文件$ sequence_number \\ N的;
关闭($输出);
打开($输出,>中,输出$ sequence_number.out)或死亡$ !;
打印{$输出} $头除非$看到{$ sequence_number} ++;
$ last_sequence_number = $ sequence_number;
}
打印{$}输出$ _;
}
这样做是:
- 读两行从您的输入找出头。
- 通过输入其余运行,提取'数位。
- 1000分就想出一个文件号写入。
- 打开该新文件,如果是相关的。 (如果这是第一次它这样做,写一些头)。
- 打印当前行到当前打开的文件。
通过任何管道或调用 myscript.pl<&名GT;
a file I am working with looks like this
NAMES n0 n1 n2 n3 n4 n5 n6 n7
REGION chr 1 100000
404 AAAAAAGA
992 TTTTTTTA
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC
would need to split the file based on the number in the column - create a new file for every 1000 units so the output would e be
file1
NAMES n0 n1 n2 n3 n4 n5 n6 n7
REGION chr 404 992
404 AAAAAAGA
992 TTTTTTTA
file2
NAMES n0 n1 n2 n3 n4 n5 n6 n7
REGION chr 1146 1778
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC
so split the first colum every 1000 units (first is from 1 to 1000) file 2 is from 1000 to 2000 also the start an end positions would be changed in every file (line starting with REG) as the first number is the number in the first line of the file adn the other number is the number in the last line of hte file. The header needs to be present in all files. Is there a way to name the files from that systematically with file1, file2....? /t is used throughout all files to make space...
i tried
awk '
NR==1 {
h = $0
k = 1000
f = "file"k/1000
print > f
getline
print "REGION chr",k-999,k > f
next
}
$1 <=k {
print > f
next
}
{
k=1000*int(1+$1/1000)
f="file"k/1000
print h > f
print "REGION chr",k-999,k > f
print > f
}' file
You have an awk
answer, but as this question is tagged perl
I'll chip in a perl one too.
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $header = <> . <>;
print $header;
my $last_sequence_number = 0;
open( my $output, ">", "output.$last_sequence_number.out" ) or die $!;
print {$output} $header;
$seen{$last_sequence_number}++;
while (<>) {
my ($key) = split;
next unless $key =~ m/^\d+$/;
my $sequence_number = int( $key / 1000 );
if ( not $sequence_number == $last_sequence_number ) {
print "Opening new file for $sequence_number\n";
close($output);
open( $output, ">", "output.$sequence_number.out" ) or die $!;
print {$output} $header unless $seen{$sequence_number}++;
$last_sequence_number = $sequence_number;
}
print {$output} $_;
}
What this does is:
- read two lines from your input to figure out the headers.
- run through the rest of the input, extracting the 'number bit'.
- divides it by 1000 to figure out a 'file number' to write to.
- opens a new file for that if it's relevant. (And if it's the first time it's done so, writes some headers).
- prints the current line to the currently open file.
Invoke via either a pipe or myscript.pl <filename>
这篇关于从柱用Perl或bash解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!