大数据读取子样本 - [R [英] Big data read subsamples R

查看:377
本文介绍了大数据读取子样本 - [R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我非常感谢您的时间来阅读这一点。

I'm most grateful for your time to read this.

我的CSV格式的600万条记录和3000(主要是分类数据)列的尤伯杯大小30GB文件。我想引导的多项式回归子样本,但它证明很难,即使在我的机器我64GB的RAM和两倍的交换文件,过程变得超级慢和死机。

I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.

我在考虑R中生成子样本indicies并将它们送入使用awk或者sed系统命令,但不知道如何做到这一点。如果有人知道的清洁方式只用R命令要做到这一点,我会很感激。

I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.

一个问题是,我需要选择子样本的完整的观察,这是我需要有一个特定的多项观察所有的行 - 他们不是从继续观察相同的长度。我打算使用glmnet,然后一些花哨的转变得到一个近似的多项案例。另一点是,我不知道该如何选择样本量,以适应​​周围的内存限制。

One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.

鸭preciate您的想法很大。

Appreciate your thoughts greatly.

R.version
platform       x86_64-pc-linux-gnu          
arch           x86_64                       
os             linux-gnu                    
system         x86_64, linux-gnu            
status                                      
major          2                            
minor          15.1                         
year           2012                         
month          06                           
day            22                           
svn rev        59600                        
language       R                            
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows   

尤达

推荐答案

由于themel指出,R是非常对阅读的CSV文件很慢。结果
如果你有sqlite的,它确实是最好的方法,因为它出现了数据挖掘不只是
有一段时间,但在多个会议上,以多种方式。

As themel has pointed out, R is very very slow on reading csv files.
If you have sqlite, it really is the best approach, as it appears the data mining is not just for one time, but over multiple session, in multiple ways.

让我们看看我们有

R中这样做就像是20倍缓慢,相比

Doing this in R is like 20 times slow, compared to a tool written in C (on my machine)

这是非常缓慢的。

read.csv( file='filename.csv' , head=TRUE , sep=",")

转换成从那里事先的Stata文件DTA和负载

不是很大,但它应该工作(我从来没有尝试过一个30演出文件,所以我不能肯定地说)

Convert to stata dta file beforehand and load from there

Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)

http://www.stata.com/help.cgi?dta <使用资源/ A>
//svn.r:从 HTTPS和code -project.org/R-packages/trunk/foreign/src/stataread.c 读写
http://sourceforge.net/projects/libcs​​v/ 结果
(这在过去已经完成,但是我还没有使用它,所以我不知道它如何执行)

Using the resource from http://www.stata.com/help.cgi?dta and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)

然后使用海外包(的 http://cran.r-project.org/web/packages/foreign/index.html ),一个简单的

Then using the foreign package (http://cran.r-project.org/web/packages/foreign/index.html), a simple

library(foreign)
whatever <- read.dta("file.dta")

将加载数据

从s SQL控制台

LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table 
IGNORE 1 LINES              <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'

或者

mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase

然后从R控制台玩,使用 RMySQL 研究界面到MySQL数据库
http://cran.r-project.org/web/packages/RMySQL /index.html

Then play from the R console, using RMySQL R interface to the MySQL database http://cran.r-project.org/web/packages/RMySQL/index.html

install.packages('RMySQL')

然后玩像

mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)

使用R以完成所有的源码/ PostgreSQL的/ MySQL的后端SQL的东西导入CSV(Reccomended)

下载,从的https://$c$c.google.com/p/ sqldf / ,如果​​你不具备封装结果
svn签的http://sqldf.google$c$c.com/svn/trunk/ sqldf-只读

Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)

Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only

从R控制台,

install.packages("sqldf")
# shows built in data frames
data() 

# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))

和您去!

presonally,我会建议图书馆(sqldf)选项: - )

Presonally, I would recomend the library(sqldf) option :-)

这篇关于大数据读取子样本 - [R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆