大数据读取子样本 - [R [英] Big data read subsamples R

查看：377 发布时间：2016/7/28 14:55:29 linux r awk system statistics-bootstrap

本文介绍了大数据读取子样本 - [R的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我非常感谢您的时间来阅读这一点。

I'm most grateful for your time to read this.

我的CSV格式的600万条记录和3000（主要是分类数据）列的尤伯杯大小30GB文件。我想引导的多项式回归子样本，但它证明很难，即使在我的机器我64GB的RAM和两倍的交换文件，过程变得超级慢和死机。

I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.

我在考虑R中生成子样本indicies并将它们送入使用awk或者sed系统命令，但不知道如何做到这一点。如果有人知道的清洁方式只用R命令要做到这一点，我会很感激。

I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.

一个问题是，我需要选择子样本的完整的观察，这是我需要有一个特定的多项观察所有的行 - 他们不是从继续观察相同的长度。我打算使用glmnet，然后一些花哨的转变得到一个近似的多项案例。另一点是，我不知道该如何选择样本量，以适应周围的内存限制。

One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.

鸭preciate您的想法很大。

Appreciate your thoughts greatly.

R.version
platform       x86_64-pc-linux-gnu          
arch           x86_64                       
os             linux-gnu                    
system         x86_64, linux-gnu            
status                                      
major          2                            
minor          15.1                         
year           2012                         
month          06                           
day            22                           
svn rev        59600                        
language       R                            
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows

尤达

转换成从那里事先的Stata文件DTA和负载

不是很大，但它应该工作（我从来没有尝试过一个30演出文件，所以我不能肯定地说）

Convert to stata dta file beforehand and load from there

Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)

http://www.stata.com/help.cgi?dta <使用资源/ A>
//svn.r：从 HTTPS和code -project.org/R-packages/trunk/foreign/src/stataread.c 读写
和 http://sourceforge.net/projects/libcsv/ 结果
（这在过去已经完成，但是我还没有使用它，所以我不知道它如何执行）

Using the resource from http://www.stata.com/help.cgi?dta and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)

然后使用海外包（的 http://cran.r-project.org/web/packages/foreign/index.html ），一个简单的

Then using the foreign package (http://cran.r-project.org/web/packages/foreign/index.html), a simple

library(foreign)
whatever <- read.dta("file.dta")

将加载数据

从s SQL控制台

LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table 
IGNORE 1 LINES              <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'

或者

mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase

然后从R控制台玩，使用 RMySQL 研究界面到MySQL数据库
http://cran.r-project.org/web/packages/RMySQL /index.html

Then play from the R console, using RMySQL R interface to the MySQL database http://cran.r-project.org/web/packages/RMySQL/index.html

install.packages('RMySQL')

然后玩像

mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)

使用R以完成所有的源码/ PostgreSQL的/ MySQL的后端SQL的东西导入CSV（Reccomended）

下载，从的https：//$c$c.google.com/p/ sqldf / ，如果你不具备封装结果
或 svn签的http：//sqldf.google$c$c.com/svn/trunk/ sqldf-只读

Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)

Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only

从R控制台，

install.packages("sqldf")
# shows built in data frames
data() 

# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))

和您去！

presonally，我会建议图书馆（sqldf）选项： - ）

Presonally, I would recomend the library(sqldf) option :-)

这篇关于大数据读取子样本 - [R的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大数据读取子样本 - [R [英] Big data read subsamples R

问题描述

推荐答案

转换成从那里事先的Stata文件DTA和负载

Convert to stata dta file beforehand and load from there

使用R以完成所有的源码/ PostgreSQL的/ MySQL的后端SQL的东西导入CSV（Reccomended）

Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

大数据读取子样本 - [R [英] Big data read subsamples R

问题描述

推荐答案

转换成从那里事先的Stata文件DTA和负载

Convert to stata dta file beforehand and load from there

使用R以完成所有的源码/ PostgreSQL的/ MySQL的后端SQL的东西导入CSV（Reccomended）

Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭