R独特的列或行与NA无与伦比 [英] R unique columns or rows incomparables with NA
问题描述
任何人都知道unique()
或duplicated()
的incomparables
自变量是否曾经在incomparables=FALSE
之后实现吗?
Anyone know if the incomparables
argument of unique()
or duplicated()
has ever been implemented beyond incomparables=FALSE
?
也许我不明白它应该如何工作...
Maybe I don't understand how it is supposed to work...
无论如何,我正在寻找一种精巧的解决方案,以使除额外的NA
之外仅保留与另一列相同的唯一列(或行)?例如,我可以使用cor()
对其进行暴力破解,但是对于成千上万的列,这很棘手.
Anyway I'm looking for a slick solution to keep only unique columns (or rows) that are identical to another column besides extra NA
s? I can brute force it using cor()
for example, but for tens of thousands of columns, this is intractable.
这里有一个例子,很抱歉,如果它有点混乱,但我认为它说明了这一点.制作一些矩阵z
:
Heres an example, sorry if its a little messy, but I think it illustrates the point. Make some matrix z
:
z <- matrix(sample(c(1:3, NA), 100, replace=TRUE), 10, 10)
colnames(z) <- paste("c", 1:10, sep="")
rownames(z) <- paste("r",1:10, sep="")
let会添加几个带有额外的NA
的重复列,并对这些列进行随机化(这样,它们并不总是在末尾).
lets add a couple duplicate columns with extra NA
s, and randomize the columns, (that way they aren't always at the end).
c3.1 <- z[, 3]
c3.1[sample(1:10, 3)] <- NA
c8.1 <- z[, 8]
c8.1[sample(1:10, 5)] <- NA
z <- cbind(z, c3.1, c8.1)
z <- z[, sample(1:ncol(z))]
所以我可以按丢失的数字进行排序,然后看起来duplicated()
或unique()
可以工作,但它不想忽略丢失.
So I could sort by the number missing, then it would seem as though duplicated()
or unique()
would work, but it doesn't like to ignore missing.
missing <- apply(z, 2, function(x) {length(which(is.na(x)))})
z.sorted <- z[, order(missing)]
z.sorted[,!duplicated(z.sorted,MARGIN=2)]
unique(z.sorted,MARGIN=2)
我认为这是incomparables
参数专门用于的目的,但似乎尚未实现:
I figured this is what the incomparables
argument was specifically for, but it doesn't appear to be implemented yet:
z.sorted[,!duplicated(z.sorted,MARGIN=2,incomparables=NA)]
unique(z.sorted,MARGIN=2,incomparables=NA)
我知道我很可能会尽快找到一个不太优雅的解决方案,我想我更多是在问为什么尚未实施呢?或者如果我只是用错了.似乎我经常遇到这种情况,但是我搜索了好一阵子却没有找到答案.有什么想法吗?
I know I will likely find a less elegant solution soon enough, I guess I'm more asking about why this hasn't been implemented yet? or if I'm just using it wrong. Seems I run into this quite often, yet I searched around for quite a while without finding answer. Any thoughts?
推荐答案
您怀疑,对于unique
的data.frame
和matrix
方法,尚未实现incomparables != FALSE
.它 以默认方法实现,该方法用于不带暗点的矢量.例如:
As you suspect, for the data.frame
and matrix
methods of unique
, incomparables != FALSE
is not yet implemented. It is implemented in the default method, which is used for vectors without dims. E.g.:
unique(c(1, 2, 2, 3, 3, 3, NA, NA, NA), incomparables=2)
## [1] 1 2 2 3 NA
unique(c(1, 2, 2, 3, 3, 3, NA, NA, NA), incomparables=NA)
## [1] 1 2 3 NA NA NA
查看unique.matrix
与unique.default
的源代码(只需在控制台中键入函数名称并单击Enter
,或在RStudio中按F2
,在新窗格中打开源代码).
Take a look at the source of unique.matrix
versus unique.default
(just type the function names into the console and hit Enter
, or press F2
in RStudio ro open the source in a new pane).
在您的情况下,您可以使用outer
创建一个矩阵,以指示特定的行/列对是否相同,而无需考虑NA
.
In your case, you could use outer
to create a matrix indicating whether particular pairs of rows/columns are the same or not, disregarding NA
s.
same <- outer(seq_len(ncol(z)), seq_len(ncol(z)),
Vectorize(function(x, y) all(z[, x]==z[, y], na.rm=TRUE)))
same
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [10,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
然后,如果您只想保留与第二列相同的列(对我来说是列c8.1
-有关我使用的完整z
矩阵,请参见此文章的底部),您可以这样做:
Then, if you want to keep only those columns that are the same as, e.g., the second column (which is column c8.1
for me - see bottom of this post for the full z
matrix I used), you can do:
z[, same[2, ]] # or, equivalently, z[, same[, 2]]
## c8.1 c8
## r1 2 2
## r2 1 1
## r3 NA 3
## r4 NA 1
## r5 3 3
## r6 NA 1
## r7 2 2
## r8 NA 1
## r9 3 3
## r10 NA 1
要将矩阵简化为唯一的列集(忽略NA
)并且具有最少的NA
个列,则可以执行以下操作:
To reduce the matrix to the set of columns that is unique (ignoring NA
), and has the least number of NA
s, you can then do:
z[, unique(sapply(apply(same, 2, which), function(x)
x[which.min(colSums(is.na(z))[x])]))]
## c7 c8 c3 c1 c6 c10 c2 c9 c4
## r1 2 2 1 2 1 1 1 2 NA
## r2 3 1 3 1 3 NA 1 2 2
## r3 2 3 2 3 1 NA 2 1 NA
## r4 2 1 1 2 2 1 3 NA 2
## r5 NA 3 2 1 3 2 NA NA 3
## r6 2 1 2 2 1 1 2 1 NA
## r7 2 2 2 2 NA 3 1 2 2
## r8 NA 1 1 3 2 NA 1 NA 1
## r9 1 3 3 2 NA 2 1 NA 2
## r10 NA 1 1 NA 1 1 1 2 3
作为参考,以下是我正在使用的z
:
c7 c8.1 c3 c1 c5 c10 c8 c6 c2 c3.1 c9 c4
r1 2 2 1 2 1 1 2 1 1 1 2 NA
r2 3 1 3 1 3 NA 1 3 1 3 2 2
r3 2 NA 2 3 1 NA 3 1 2 2 1 NA
r4 2 NA 1 2 NA 1 1 2 3 NA NA 2
r5 NA 3 2 1 3 2 3 3 NA 2 NA 3
r6 2 NA 2 2 1 1 1 1 2 2 1 NA
r7 2 2 2 2 1 3 2 NA 1 2 2 2
r8 NA NA 1 3 NA NA 1 2 1 NA NA 1
r9 1 3 3 2 1 2 3 NA 1 NA NA 2
r10 NA NA 1 NA NA 1 1 1 1 1 2 3
这篇关于R独特的列或行与NA无与伦比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!