R-按列名称将BIG表转换为矩阵 [英] R - convert BIG table into matrix by column names

查看:103
本文介绍了R-按列名称将BIG表转换为矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对现有问题的扩展:将表转换为矩阵按列名

This is an extension to an existing question: Convert table into matrix by column names

我使用的是最终答案: https://stackoverflow.com/a/2133898/1287275

I am using the final answer: https://stackoverflow.com/a/2133898/1287275

原始CSV文件矩阵大约有150万行,其中包含三列...行索引,列索引和一个值.所有数字均为长整数.底层矩阵是一个大小约为220K x 220K的稀疏矩阵,平均每行约7个值.

The original CSV file matrix has about 1.5M rows with three columns ... row index, column index, and a value. All numbers are long integers. The underlying matrix is a sparse matrix about 220K x 220K in size with an average of about 7 values per row.

原始的read.table工作正常.

The original read.table works just fine.

  x <- read.table("/users/wallace/Hadoop_Local/reference/DiscoveryData6Mo.csv", header=TRUE);

当我执行reshape命令时,我的问题来了.

My problem comes when I do the reshape command.

  reshape(x, idvar="page_id", timevar="reco", direction="wide")

CPU命中率100%,它永远存在.机器(Mac)的内存比R正在使用的内存更多.我不明白为什么构造稀疏矩阵要花这么长时间.

The CPU hits 100% and there it sits forever. The machine (a mac) has more memory than R is using. I don't see why it should take so long to construct a sparse matrix.

我正在使用默认的矩阵软件包.我没有安装任何额外的东西.我几天前才下载R,所以我应该拥有最新版本.

I am using the default matrix package. I haven't installed anything extra. I just downloaded R a few days ago, so I should have the latest version.

建议?

谢谢, 华莱士

推荐答案

我将使用Matrix包中的sparseMatrix函数.典型用法是sparseMatrix(i, j, x),其中ijx是长度相同的三个向量:分别是行索引,col索引和矩阵中非零元素的值.这是一个示例,我尝试将变量名称和尺寸与您的规范相匹配:

I would use the sparseMatrix function from the Matrix package. The typical usage is sparseMatrix(i, j, x) where i, j, and x are three vectors of same length: respectively, the row indices, col indices, and values of the non-zero elements in the matrix. Here is an example where I have tried to match variable names and dimensions to your specifications:

num.pages <- 220000
num.recos <- 230000
N         <- 1500000

df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
                 reco    = sample.int(num.recos, N, replace=TRUE),
                 value   = runif(N))
head(df)
#   page_id   reco     value
# 1   33688  48648 0.3141030
# 2   78750 188489 0.5591290
# 3  158870  13157 0.2249552
# 4   38492  56856 0.1664589
# 5   70338 138006 0.7575681
# 6  160827  68844 0.8375410

library("Matrix")
mat <- sparseMatrix(i = df$page_id,
                    j = df$reco,
                    x = df$value,
                    dims = c(num.pages, num.recos))

这篇关于R-按列名称将BIG表转换为矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆