按行号和列号子集文件 [英] Subset a file by row and column numbers

查看:129
本文介绍了按行号和列号子集文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们要在行和列上设置一个文本文件的子集,从文件中读取行和列的编号.不包括标题(行1)和行名(行1).

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).

inputFile.txt 制表符分隔的文本文件

inputFile.txt Tab delimited text file

header  62  9   3   54  6   1
25  1   2   3   4   5   6
96  1   1   1   1   0   1
72  3   3   3   3   3   3
18  0   1   0   1   1   0
82  1   0   0   0   0   1
77  1   0   1   0   1   1
15  7   7   7   7   7   7
82  0   0   1   1   1   0
37  0   1   0   0   1   0
18  0   1   0   0   1   0
53  0   0   1   0   0   0
57  1   1   1   1   1   1

subsetCols.txt 以逗号分隔,没有空格,一行,数字是有序的.在实际数据中,我们有500K列,并且需要约1万个子集.

subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.

1,4,6

subsetRows.txt 以逗号分隔,没有空格,一行,数字有序.在实际数据中,我们有2万行,并且需要约300个子集.

subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.

1,3,7

使用 cut awk 循环的当前解决方案(相关文章:使用awk选择行):

Current solution using cut and awk loop (Related post: Select rows using awk):

# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt

# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput

输出文件:result.txt

1   4   6
3   3   3
7   7   7

问题:
该解决方案适用于小文件,对于50K行和200K列的大文件,它花费的时间太长,而且持续15分钟,仍在运行.我认为剪切设置列效果很好,选择行是最慢的事情.

Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.

还有更好的方法吗?

实际输入文件信息:

# $fileInput:
#        Rows = 20127
#        Cols = 533633
#        Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers

有关文件的更多信息:文件包含 GWAS 基因型数据.每行代表样本(个人),每列代表 SNP .对于进一步的基于区域的分析,我们需要对样本(行)和SNP(列)进行子集化,以使数据更易于管理(较小),作为其他统计软件(如.

More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.

系统:

$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux


更新: @JamesBrown 下面提供的解决方案正在混合系统中的列顺序,当我使用不同版本的awk时,我的版本是:GNU Awk 3.1.7


Update: Solution provided below by @JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7

推荐答案

即使在

Awk:朝鲜.顽固地抵制变化,由于我们只能推测的原因,它的用户似乎不自然地喜欢它.

Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.

...每当您看到自己管道sed,cut,grep,awk等时,停下来对自己说: awk可以使它变得孤独!

... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!

因此,在这种情况下,只需提取行和列(通过调整它们以排除标题和第一列),然后仅缓冲输出以最终打印即可.

So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.

awk -v cols="1 4 6" -v rows="1 3 7" '
    BEGIN{
       split(cols,c); for (i in c) col[c[i]]  # extract cols to print
       split(rows,r); for (i in r) row[r[i]]  # extract rows to print
    }
    (NR-1 in row){
       for (i=2;i<=NF;i++) 
              (i-1) in col && line=(line ? line OFS $i : $i); # pick columns
              print line; line=""                             # print them
    }' file

带有您的示例文件:

$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7

使用示例文件并将输入作为变量,以逗号分隔:

With your sample file, and inputs as variables, split on comma:

awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput

我很确定这会更快.例如,您可以检查从基于第二个文本文件的文本文件中删除重复项,以获得一些基准,用于比较awkgrep和其他人.

I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.

最佳,
金正恩

Best,
Kim Jong‑un

这篇关于按行号和列号子集文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆