读取特定行的大型矩阵数据文件 [英] Reading specific rows of large matrix data file

查看:206
本文介绍了读取特定行的大型矩阵数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个巨大的m * n矩阵 X (太大而无法读入内存)和二进制数字向量 V 长度 m 。我的目标是读取 X 的行,对应 V 等于 1 (而不是与 V [i] == 0 对应的那些)进入专用的数据表 / 矩阵通过包,例如(但不一定相同) bigmemory FF 。仅对应于 V [i] == 1 的行。

Suppose I have a gigantic m*n matrix X (that is too big to read into memory) and binary numeric vector V with length m. My objective is to read the rows of X that correspond to V equalling 1 (and not those corresponding to V[i] == 0) into a dedicated data table/matrix through a package such as (but not necessarily identical to) bigmemory or ff. only for the rows corresponding to V[i] == 1.

这可以通过黑客攻击来实现c $ c> nrows 和跳过等等 read.table 但是我我正在寻找 bigmemory ff 等。因RAM不足而输入类型解决方案。

This can be done by hacking nrows and skip and so on in read.table but I'm looking for a bigmemory, ff et al. type solution due to insufficient RAM.

这是一个MWE,它不能反映我的 X 的实际大小。

Here's a MWE that does not reflect the true size of my X.

X <- array(rnorm(100*5),dim=c(100,5))
write.csv(X,"target.csv")
V <- sample(c(rep(1,50),rep(0,50))) #Only want to read in half the rows corresponding to 1's
rm(X)

#Now ... How to read "target.csv"?


推荐答案

如何使用命令行工具 sed ,构造一个命令,传递你想在命令中读取的行。我不确定是否会对此有一些命令长度限制...

How about you use the command line tool sed, constructing a command that passes along the lines you want to read in the command. I am not sure if there would be some command length limit on this...

#  Check the data
head( X )
#           [,1]        [,2]       [,3]       [,4]        [,5]
#[1,]  0.2588798  0.42229528  0.4469073  1.0684309  1.35519389
#[2,]  1.0267562  0.80299223 -0.2768111 -0.7017247 -0.06575137
#[3,]  1.0110365 -0.36998260 -0.8543176  1.6237827 -1.33320291
#[4,]  1.5968757  2.13831188  0.6978655 -0.5697239 -1.53799156
#[5,]  0.1284392  0.55596342  0.6919573  0.6558735 -1.69494827
#[6,] -0.2406540 -0.04807381 -1.1265165 -0.9917737  0.31186670

#  Check V, note row 6 above should be skipped according to this....
head(V)
# [1] 1 1 1 1 1 0

#  Get line numbers we want to read
head( which( V == 1 ) )
# [1] 1 2 3 4 5 7

#  Read the first 5 lines where V == '1' in your example (remembering to include an extra line for the header row, hence the +1 in 'which()')
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 )[1:6] + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)

#  X        V1         V2         V3         V4          V5
#1 1 0.2588798  0.4222953  0.4469073  1.0684309  1.35519389
#2 2 1.0267562  0.8029922 -0.2768111 -0.7017247 -0.06575137
#3 3 1.0110365 -0.3699826 -0.8543176  1.6237827 -1.33320291
#4 4 1.5968757  2.1383119  0.6978655 -0.5697239 -1.53799156
#5 5 0.1284392  0.5559634  0.6919573  0.6558735 -1.69494827
#6 7 0.6856038  0.1082029  0.1523561 -1.4147429 -0.64041290

我们实际传递给 sed 的命令是......

The command we are actually passing to sed is...

 "sed -n '1p; 2p; 3p; 4p; 5p; 6p; 8p' C:/Data/target.csv"

我们使用 -n 关闭任何行的打印,然后我们使用一个分号分隔的行号,我们想读取这些行号,由给我们(V == 1),最后是目标文件名。记住这些行号已被 +1 抵消,以说明组成标题行的行。

We use -n to turn off printing of any lines, and then we use a semi-colon separated vector of lines numbers that we do want to read, given to us by which( V == 1 ), and finally the target filename. Remember these line numbers have been offset by +1 to account for the line that makes up the header row.

这篇关于读取特定行的大型矩阵数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆