在R中并行化SQL查询 [英] Parallelizing SQL queries in R

查看:60
本文介绍了在R中并行化SQL查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有六个SQL查询,我通过R编写脚本,每个查询都花费很长时间(每个〜30分钟).每个查询返回后,我便会处理一些标准报告的数据.

I have six SQL queries that I script though R that each take a very long time (~30 minutes each). Once each query returns I then manipulate the data for some standard reports.

我想做的是使用我的多核计算机从R并行运行这些SQL请求.

What I'd like to do is use my multicore machine to run these SQL requests in parallel from R.

我在带有Oracle DB的Windows机器上.我正在关注博客帖子以使用doSNOW和foreach尝试拆分这些请求,这是我可以在

I'm on a Windows machine with a Oracle DB. I was following a blog post to use doSNOW and foreach to try and split these requests and this is the best thing I can find on stackoverflow.

我已经能够使该过程适用于foreach的非并行%do%版本,但不适用于%dopar%.与%dopar%一起,它仅返回一个空集.下面是设置表并运行查询的代码,因此您可以看到发生了什么.如果基本代码过多,请提前道歉.

I've been able to get the process to work for the non-parallel %do% version of foreach but not the %dopar%. With %dopar% it just returns an empty set. Below is code that sets up tables and runs the queries so you can see what happens. Apologies in advance if there's too much basic code.

我看过其他一些R软件包,但没有找到明显的解决方案.另外,如果您有更好的方法来管理这种过程,我也很想听听它-请记住,我是一名分析师,而不是计算机科学家.谢谢!

I've looked at some of the other R packages but didn't see an obvious solution. Also if you have a better way to manage this kind of process I'd be interested to hear it - just keep in mind I'm an analyst not a computer scientist. Thanks!

#Creating a cluster
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl)

#Connecting to database through RODBC
ch=odbcConnect("",pwd = "xxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)

#Creating database tables for example purposes
qryA1 <- "create table temptable(test int)" 
qryA2 <- "insert into temptable(test) values((1))" 
qryA3 <- "select * from temptable" 
qryA4 <- "drop table temptable" 
qryB1 <- "create table temptable2(test int)" 
qryB2 <- "insert into temptable2(test) values((2))" 
qryB3 <- "select * from temptable2" 
qryB4 <- "drop table temptable2"  

sqlQuery(ch, qryA1) 
sqlQuery(ch, qryA2) 
doesItWork <- sqlQuery(ch, qryA3) 
doesItWork
sqlQuery(ch, qryB1) 
sqlQuery(ch, qryB2) 
doesItWork <- sqlQuery(ch, qryB3) 
doesItWork

result = c()
output = c()
databases <- list('temptable','temptable2')


#Non-parallel version of foreach
system.time(
foreach(i = 1:2)%do%{
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
output[i] = result
}
) 

output

#Parallel version of foreach

outputPar = c()

system.time(
foreach(i = 1:2)%dopar%{
#Connecting to database through RODBC
ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
outputPar[i] = result
}
) 

outputPar

sqlQuery(ch, qryA4)
sqlQuery(ch, qryB4) 

推荐答案

在串行foreach循环中进行赋值outputPar[i] = result时,这是可以的(但实际上不是foreach的预期用途).当您在并行循环中进行此分配时,这是不正常的.参见 http://tolstoy.newcastle.edu.au/R/e10/help/10/04/3237.html ,针对大卫·史密斯(David Smith)在《革命》杂志上回答的类似问题.

When you make the assignment outputPar[i] = result inside the serial foreach loop, this is OK (but not really the intended use of foreach). When you make this assignment in the parallel loop, it is not OK. See http://tolstoy.newcastle.edu.au/R/e10/help/10/04/3237.html for a similar question answered by David Smith at Revolution.

作为解决方案,

system.time(
  outputPar <- foreach(i = 1:2, .packages="RODBC")%dopar%{
#Connecting to database through RODBC
  ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
  odbcGetInfo(ch)
  result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
  result
}
)

这篇关于在R中并行化SQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆