在R中的非常大的数据集上执行PCA [英] doing PCA on very large data set in R

查看:192
本文介绍了在R中的非常大的数据集上执行PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在CSV文件中有一个非常大的训练集(〜2Gb).该文件太大,无法直接读入内存(read.csv()使计算机停止运行),我想使用PCA减小数据文件的大小.问题是(据我所知),我需要将文件读入内存才能运行PCA算法(例如princomp()).

I have a very large training set (~2Gb) in a CSV file. The file is too large to read directly into memory (read.csv() brings the computer to a halt) and I would like to reduce the size of the data file using PCA. The problem is that (as far as I can tell) I need to read the file into memory in order to run a PCA algorithm (e.g., princomp()).

我已经尝试过bigmemory包以big.matrix的形式读取文件,但是princompbig.matrix对象上不起作用,并且看来big.matrix不能转换成某种东西就像data.frame.

I have tried the bigmemory package to read the file in as a big.matrix, but princomp doesn't function on big.matrix objects and it doesn't seem like big.matrix can be converted into something like a data.frame.

是否可以在丢失的大数据文件上运行princomp?

Is there a way of running princomp on a large data file that I'm missing?

我是R的相对新手,所以其中一些对于经验丰富的用户可能是显而易见的(在此道歉).

I'm a relative novice at R, so some of this may be obvious to more seasoned users (apologies in avance).

感谢您提供任何信息.

推荐答案

我解决问题的方法是迭代计算样本协方差矩阵.这样,您只需要任何时间点的数据子集.可以使用readLines来仅读取一部分数据,您可以在其中打开与文件的连接并进行迭代读取.该算法看起来很像(它是一个两步算法):

The way I solved it was by calculating the sample covariance matrix iteratively. In this way you only need a subset of the data for any point in time. Reading in just a subset of the data can be done using readLines where you open a connection to the file and read iteratively. The algorithm looks something like (it is a two-step algorithm):

计算每列的平均值(假设是变量)

Calculate the mean values per column (assuming that are the variables)

  1. 打开文件连接(con = open(...))
  2. 读取1000行(readLines(con, n = 1000))
  3. 计算每列的平方和
  4. 将这些平方和添加到变量(sos_column = sos_column + new_sos)
  5. 重复2-4直到文件结束.
  6. 除以行数减1得到平均值.
  1. Open file connection (con = open(...))
  2. Read 1000 lines (readLines(con, n = 1000))
  3. Calculate sums of squares per column
  4. Add those sums of squares to a variable (sos_column = sos_column + new_sos)
  5. Repeat 2-4 until end of file.
  6. Divide by number of rows minus 1 to get the mean.

计算协方差矩阵:

  1. 打开文件连接(con = open(...))
  2. 读取1000行(readLines(con, n = 1000))
  3. 使用crossprod
  4. 计算所有叉积
  5. 将这些叉积​​保存在变量中
  6. 重复2-4直到文件结束.
  7. 除以行数减去1得到协方差.
  1. Open file connection (con = open(...))
  2. Read 1000 lines (readLines(con, n = 1000))
  3. Calculate all cross-products using crossprod
  4. Save those crossproducts in a variable
  5. Repeat 2-4 until end of file.
  6. divide by the number of rows minus 1 to get the covariance.

当拥有协方差矩阵时,只需用covmat = your_covmat调用princomp即可,而princomp将自己跳过对协方差矩阵的计算.

When you have the covariance matrix, just call princomp with covmat = your_covmat and princomp will skip calulating the covariance matrix himself.

通过这种方式,您可以处理的数据集远远大于可用的RAM.在迭代过程中,内存使用率大致是该块占用的内存(例如1000行),之后内存使用率限于协方差矩阵(nvar * nvar double).

In this way the datasets you can process are much, much larger than your available RAM. During the iterations, the memory usage is roughly the memory the chunk takes (e.g. 1000 rows), after that the memory usage is limited to the covariance matrix (nvar * nvar doubles).

这篇关于在R中的非常大的数据集上执行PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆