制作稀疏矩阵时出错 [英] Error when making a sparse matrix

查看:483
本文介绍了制作稀疏矩阵时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正面临一个我不明白的问题.这是建议在此处

I am facing a problem I do not understand. It's a follow-up on answers suggested here and here

我有两个结构相同的数据集.我创建了一个可复制的示例,该示例适用于该代码,而另一个是真实的,该示例不适用于该代码.盯着它看了几个小时,我找不到导致错误的原因. 下面给出了一个有效的示例

I have two identically structured datasets. One I created as a reproducible example for which the code works, and one which is real for which the code does not work. After staring at it for hours I cannot find what is causing the error. The following gives an example that works

    df <- data.table(cbind(rep(seq(1,25), each = 4 )), cbind(rep(seq(1,40), length.out = 100)))
    colnames(df) <- c("a", "b") #ignore warning
setkey(df, a, b)

这只是创建一个可复制的示例.当我应用上述SO文章中建议的-稍作调整-的代码时,我得到的是我要寻找的:一个稀疏矩阵,指示b列中的两个元素何时对于a列的值一起出现

This is just to create a reproducible example. When I apply the - slightly adjusted - code suggested in the mentioned SO articles I get what I am looking for: a sparse matrix that indicates when two elements in column b occur together for values of column a

library(Matrix)
s <- sparseMatrix(
  df$a,
  df$b,
    dimnames = list(
        unique(df$a),unique(df$b)), x = 1)
v <- t(s) %*% s

现在-在我看来-在更长的真实数据集上完全一样.

Now I am doing - in my eyes - exactly the same on my real dataset which is much longer.

下面的示例dput看起来像这样

A sample dput below looks like this

test <- dput(dk[1:50,])
structure(list(pid = c(204L, 204L, 207L, 254L, 254L, 258L, 258L, 
258L, 258L, 258L, 265L, 265L, 269L, 269L, 269L, 269L, 1520L, 
1520L, 1520L, 1520L, 1532L, 1532L, 1534L, 1534L, 1534L, 1534L, 
1539L, 1539L, 1543L, 1543L, 1546L, 1546L, 1546L, 1546L, 1546L, 
1546L, 1546L, 1549L, 1549L, 1549L, 1559L, 1559L, 1559L, 1559L, 
1559L, 1559L, 1559L, 1561L, 1561L, 1561L), cid = c(11023L, 11787L, 
14232L, 14470L, 14480L, 1290L, 1637L, 4452L, 13964L, 14590L, 
17814L, 23453L, 6658L, 10952L, 17259L, 27549L, 11034L, 22748L, 
23345L, 23347L, 10487L, 11162L, 15570L, 15629L, 17983L, 17999L, 
17531L, 22497L, 14425L, 14521L, 11495L, 24948L, 24962L, 24969L, 
24972L, 24973L, 30627L, 17886L, 18428L, 23972L, 13890L, 13936L, 
14432L, 21230L, 21271L, 21384L, 21437L, 341L, 354L, 6302L)), .Names = c("pid", 
"cid"), sorted = c("pid", "cid"), class = c("data.table", "data.frame"
), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000100788>)

然后在运行相同的公式时,出现错误

Then when running the same formula, I get an error

s <- sparseMatrix(test$pid,test$cid,dimnames = list(unique(test$pid), unique(test$cid)),x = 1)

错误(也出现在test数据集中)如下:

The Error (which occurs in the test dataset as well) reads as follows:

Error in validObject(r) : 
  invalid class "dgTMatrix" object: length(Dimnames[[1]])' must match Dim[1]

当我删除dimnames时,问题消失了,但是我确实需要这些暗名来理解结果.我确定我会错过一些显而易见的东西.有人可以告诉我这是什么吗?

The problem disappears when I remove the dimnames but I really need these dimnames to make sense of the results. I'm sure I'm missing out on something obvious. Can someone please tell me what it is ?

推荐答案

我们可以将'pid','cid'列转换为factor并强制转换回numeric或将matchunique值一起使用每个列的行以获取行/列索引,这应该在创建sparseMatrix时起作用.

We can convert the 'pid', 'cid' columns to factor and coerce back to numeric or use match with unique values of each column to get the row/column index and this should work in creating sparseMatrix.

test1 <- test[, lapply(.SD, function(x) 
                 as.numeric(factor(x, levels=unique(x))))]

或者我们使用match

test1 <- test[, lapply(.SD, function(x) match(x, unique(x)))]

s1 <- sparseMatrix(test1$pid,test1$cid,dimnames = list(unique(test$pid), 
                 unique(test$cid)),x = 1)
dim(s1)
#[1] 15 50

s1[1:3, 1:3]
#3 x 3 sparse Matrix of class "dgCMatrix"
#    11023 11787 14232
#204     1     1     .
#207     .     .     1
#254     .     .     .

head(test)
#   pid   cid
#1: 204 11023
#2: 204 11787
#3: 207 14232
#4: 254 14470
#5: 254 14480
#6: 258  1290

如果我们要为'test'中指定的完整行/列索引使用此符号,则需要将dimnames的长​​度设置为与'pid','cid'的max相同的长度

If we want this for the full row/column index specified in 'test', we need to make the dimnames as the same length as the max of 'pid', 'cid'

rnm <- seq(max(test$pid))
cnm <- seq(max(test$cid))
s2 <- sparseMatrix(test$pid, test$cid, dimnames=list(rnm, cnm))
dim(s2)
#[1]  1561 30627
s2[1:3, 1:3]
#3 x 3 sparse Matrix of class "ngCMatrix"
# 1 2 3
#1 . . .
#2 . . .
#3 . . .

这篇关于制作稀疏矩阵时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆