按行号和列号子集文件 [英] Subset a file by row and column numbers
问题描述
我们要在行和列上设置一个文本文件的子集,从文件中读取行和列的编号.不包括标题(行1)和行名(行1).
We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt 制表符分隔的文本文件
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt 以逗号分隔,没有空格,一行,数字是有序的.在实际数据中,我们有500K列,并且需要约1万个子集.
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt 以逗号分隔,没有空格,一行,数字有序.在实际数据中,我们有2万行,并且需要约300个子集.
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
使用 cut 和 awk 循环的当前解决方案(相关文章:使用awk选择行):
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
输出文件:result.txt
1 4 6
3 3 3
7 7 7
问题:
该解决方案适用于小文件,对于50K行和200K列的大文件,它花费的时间太长,而且持续15分钟,仍在运行.我认为剪切设置列效果很好,选择行是最慢的事情.
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
还有更好的方法吗?
实际输入文件信息:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
有关文件的更多信息:文件包含 GWAS 基因型数据.每行代表样本(个人),每列代表 SNP .对于进一步的基于区域的分析,我们需要对样本(行)和SNP(列)进行子集化,以使数据更易于管理(较小),作为其他统计软件(如 r .
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
系统:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
更新: @JamesBrown 下面提供的解决方案正在混合系统中的列顺序,当我使用不同版本的awk时,我的版本是:GNU Awk 3.1.7
Update: Solution provided below by @JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
推荐答案
即使在
Awk:朝鲜.顽固地抵制变化,由于我们只能推测的原因,它的用户似乎不自然地喜欢它.
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on. ...每当您看到自己管道sed,cut,grep,awk等时,停下来对自己说: awk可以使它变得孤独! ... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone! 因此,在这种情况下,只需提取行和列(通过调整它们以排除标题和第一列),然后仅缓冲输出以最终打印即可. So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it. 带有您的示例文件: 使用示例文件并将输入作为变量,以逗号分隔: With your sample file, and inputs as variables, split on comma: 我很确定这会更快.例如,您可以检查从基于第二个文本文件的文本文件中删除重复项,以获得一些基准,用于比较 I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of 最佳, Best, 这篇关于按行号和列号子集文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
awk
与grep
和其他人.awk
over grep
and others.
金正恩
Kim Jong‑un