R子集函数,包括"["不适用于大型数据框/矩阵的中间范围 [英] R subset functions, including '[' not working on middle range of large dataframe/matrix

查看:94
本文介绍了R子集函数,包括"["不适用于大型数据框/矩阵的中间范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到一个奇怪的问题,我在一个大数据帧上循环以从2列数据中创建3D条形图,其中Z轴是频率.原始数据框如下所示(请原谅多余的列):

> head(MergedBH)
                   Row.names           V1.x            V2.x V3.x  V4.x V5.x
RFL_Contig1       RFL_Contig1    RFL_Contig1 Scaffold3494078 1.00 1.000  470
RFL_Contig100   RFL_Contig100  RFL_Contig100 Scaffold2661063 0.61 0.975  236
RFL_Contig1000 RFL_Contig1000 RFL_Contig1000  Scaffold861300 0.96 0.995  451
RFL_Contig1001 RFL_Contig1001 RFL_Contig1001 Scaffold4753307 0.67 0.982  568
RFL_Contig1002 RFL_Contig1002 RFL_Contig1002  Scaffold317096 1.00 0.996 1513
RFL_Contig1003 RFL_Contig1003 RFL_Contig1003   Scaffold60619 0.90 1.000  698
                     V1.y                  V2.y V3.y  V4.y V5.y
RFL_Contig1       RFL_Contig1 ta_contig_5DS_2768763 1.00 1.000  572
RFL_Contig100   RFL_Contig100  ta_contig_4DS_482537 0.56 0.966  737
RFL_Contig1000 RFL_Contig1000 ta_contig_2AL_5829507 0.83 0.944 1573
RFL_Contig1001 RFL_Contig1001 ta_contig_7BS_3161139 1.00 0.999  910
RFL_Contig1002 RFL_Contig1002 ta_contig_3B_10401908 1.00 0.997 2681
RFL_Contig1003 RFL_Contig1003 ta_contig_2AL_6424276 0.70 1.000 1004

我想创建一个3d条形图,其中x轴为$ V4.x,y轴为$ V4.y.我没有使用典型的hist2d函数,因为在1,1位置有很多权重,我们也想将那个位置的权重与其他位置可视化.为此,我创建了一个3列矩阵,其中第1-2列包含分别在V4.x和y(.8-1 x.001)范围内的所有成对组合,最后一列是频率.我使用以下代码行:

> for3d.mat <- matrix(ncol=3,nrow=0)
> for(i in seq(.8,1,by=.001)){for(j in seq(.8,1,by=.001)){iter.mat <- matrix(ncol=3,c(i,j,length(subset(MergedBH,MergedBH$V4.x==i & MergedBH$V4.y==j)$V4.x)));for3d.mat <- rbind(for3d.mat,iter.mat)}}
> subset(for3d.mat,for3d.mat[,1] == .975 & for3d.mat[,2] == .966)
 [,1] [,2] [,3]
> for3d.mat[35350:35325,]
   [,1]  [,2] [,3]
 [1,] 0.975 0.974    0
 [2,] 0.975 0.973    0
 [3,] 0.975 0.972    0
 [4,] 0.975 0.971    0
 [5,] 0.975 0.970    0
 [6,] 0.975 0.969    0
 [7,] 0.975 0.968    0
 [8,] 0.975 0.967    0
 [9,] 0.975 0.966    0
[10,] 0.975 0.965    0
[11,] 0.975 0.964    0
[12,] 0.975 0.963    0
[13,] 0.975 0.962    0
[14,] 0.975 0.961    0
[15,] 0.975 0.960    0
[16,] 0.975 0.959    0
[17,] 0.975 0.958    0
[18,] 0.975 0.957    0

以某种方式,在处理大型矩阵时,子集不会拾取RFL_Contig100的值.975,.966,当我找到正确的行时,它的频率值为0,但是如果我采用该值for循环中的一行会运行并正确输入:

> matrix(ncol=3,c(i,j,length(subset(MergedBH,MergedBH$V4.x==i & MergedBH$V4.y==j)$V4.x)))
     [,1]  [,2] [,3]
[1,] 0.975 0.966    1

有关此问题的任何建议?我尝试了几种不同的方法来执行此操作,但无法绕开子集函数,是否会有另一种方法来计算每个仓的深度,以便用于3D条形图一次可视化所有点?

预先感谢

更新:

使用'['得到相同的问题,其中矩阵的很大一部分(介于0.92和.98之间)没有得到处理:

> for3d.mat <- matrix(ncol=3,nrow=0)
> for(i in seq(.8,1,by=.001)){for(j in seq(.8,1,by=.001)){iter.mat <- matrix(ncol=3,c(i,j,length(MergedBH[MergedBH$V4.x ==i & MergedBH$V4.y ==j,]$V4.x)));for3d.mat <- rbind(for3d.mat,iter.mat)}}
> for3d.mat[for3d.mat[,1] == .975 & for3d.mat[,2] == .966,]
 [,1] [,2] [,3]

能够在大多数矩阵上使用'['或子集,但是对于原始数据帧或for3d.mat而言,只有一个特定范围,无论哪种子设置方法都无法访问它,如下例所示:

> for3d.mat[for3d.mat[,1] == .976 & for3d.mat[,2] == .937,]
[1] 0.976 0.937    NA
> for3d.mat[for3d.mat[,1] == .975 & for3d.mat[,2] == .937,]
 [,1] [,2] [,3]

解决方案

来自?subset:

警告

这是旨在交互使用的便捷功能.为了 编程时,最好使用标准的子集功能,例如 [,尤其是参数子集的非标准评估 可能会带来意想不到的后果.

换句话说,在循环或apply样式的函数内部时,直接使用[.

我认为在新的dplyr软件包中有一个类似于subset的便利功能,如果[变得过于繁琐,您可能想研究一下它,但是[with结合通常可以正常工作.

I'm having a strange issue where I am looping over a large data frame to create a 3D barplot from the data in 2 columns, where the Z axis is the frequency. The original data frame looks like this (please excuse excess columns):

> head(MergedBH)
                   Row.names           V1.x            V2.x V3.x  V4.x V5.x
RFL_Contig1       RFL_Contig1    RFL_Contig1 Scaffold3494078 1.00 1.000  470
RFL_Contig100   RFL_Contig100  RFL_Contig100 Scaffold2661063 0.61 0.975  236
RFL_Contig1000 RFL_Contig1000 RFL_Contig1000  Scaffold861300 0.96 0.995  451
RFL_Contig1001 RFL_Contig1001 RFL_Contig1001 Scaffold4753307 0.67 0.982  568
RFL_Contig1002 RFL_Contig1002 RFL_Contig1002  Scaffold317096 1.00 0.996 1513
RFL_Contig1003 RFL_Contig1003 RFL_Contig1003   Scaffold60619 0.90 1.000  698
                     V1.y                  V2.y V3.y  V4.y V5.y
RFL_Contig1       RFL_Contig1 ta_contig_5DS_2768763 1.00 1.000  572
RFL_Contig100   RFL_Contig100  ta_contig_4DS_482537 0.56 0.966  737
RFL_Contig1000 RFL_Contig1000 ta_contig_2AL_5829507 0.83 0.944 1573
RFL_Contig1001 RFL_Contig1001 ta_contig_7BS_3161139 1.00 0.999  910
RFL_Contig1002 RFL_Contig1002 ta_contig_3B_10401908 1.00 0.997 2681
RFL_Contig1003 RFL_Contig1003 ta_contig_2AL_6424276 0.70 1.000 1004

I want to create a 3d barplot where the x axis is $V4.x and the y axis is $V4.y. I don't use the typical hist2d function since so much weight is at the 1,1 position, and we want to visualize the weight at that position against the others as well. To do this I created a 3 column matrix with columns 1-2 containing all pairwise combinations in the range of V4.x and y respectively (.8-1 by .001), and the final column being the frequency. I do this with the lines below:

> for3d.mat <- matrix(ncol=3,nrow=0)
> for(i in seq(.8,1,by=.001)){for(j in seq(.8,1,by=.001)){iter.mat <- matrix(ncol=3,c(i,j,length(subset(MergedBH,MergedBH$V4.x==i & MergedBH$V4.y==j)$V4.x)));for3d.mat <- rbind(for3d.mat,iter.mat)}}
> subset(for3d.mat,for3d.mat[,1] == .975 & for3d.mat[,2] == .966)
 [,1] [,2] [,3]
> for3d.mat[35350:35325,]
   [,1]  [,2] [,3]
 [1,] 0.975 0.974    0
 [2,] 0.975 0.973    0
 [3,] 0.975 0.972    0
 [4,] 0.975 0.971    0
 [5,] 0.975 0.970    0
 [6,] 0.975 0.969    0
 [7,] 0.975 0.968    0
 [8,] 0.975 0.967    0
 [9,] 0.975 0.966    0
[10,] 0.975 0.965    0
[11,] 0.975 0.964    0
[12,] 0.975 0.963    0
[13,] 0.975 0.962    0
[14,] 0.975 0.961    0
[15,] 0.975 0.960    0
[16,] 0.975 0.959    0
[17,] 0.975 0.958    0
[18,] 0.975 0.957    0

Somehow the value for RFL_Contig100, .975,.966, is not picked up by subset when working on the large matrix, and when I find the correct row it has a value of 0 for the frequency, but if I take that one line out of the for loop and run it it makes the correct entry:

> matrix(ncol=3,c(i,j,length(subset(MergedBH,MergedBH$V4.x==i & MergedBH$V4.y==j)$V4.x)))
     [,1]  [,2] [,3]
[1,] 0.975 0.966    1

Any suggestions on what the issue is? I've tried a few different ways of doing this but can't get around the subset function, would there be another way to compute the depth for each bin in order to use for a 3D barplot to visualize all points at once?

Thanks in advance

Update:

Getting the same problem with '[', where a large part of the matrix, between .92 and .98 is not getting processed:

> for3d.mat <- matrix(ncol=3,nrow=0)
> for(i in seq(.8,1,by=.001)){for(j in seq(.8,1,by=.001)){iter.mat <- matrix(ncol=3,c(i,j,length(MergedBH[MergedBH$V4.x ==i & MergedBH$V4.y ==j,]$V4.x)));for3d.mat <- rbind(for3d.mat,iter.mat)}}
> for3d.mat[for3d.mat[,1] == .975 & for3d.mat[,2] == .966,]
 [,1] [,2] [,3]

Am able to use '[' or subset on most of the matrix, but there is just a specific range whether for the original data frame or the for3d.mat that is not accessible by either subsetting method, example below:

> for3d.mat[for3d.mat[,1] == .976 & for3d.mat[,2] == .937,]
[1] 0.976 0.937    NA
> for3d.mat[for3d.mat[,1] == .975 & for3d.mat[,2] == .937,]
 [,1] [,2] [,3]

解决方案

From ?subset:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

In other words, use [ directly when inside a loop or apply-style function.

I think there's a convenience function somewhat like subset in the new dplyr package that you might want to look into if [ becomes too onerous, but [ in conjunction with with usually works fine.

这篇关于R子集函数,包括"["不适用于大型数据框/矩阵的中间范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆