使用R优化/向量化数据库查询 [英] Optimize/Vectorize Database Query with R

查看:100
本文介绍了使用R优化/向量化数据库查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R查询大型数据库.由于数据库的大小,我编写了一次查询以获取100行的查询.我的代码如下所示:

I am attempting to use R to query a large database. Due to the size of the database, I have written the query to fetch 100 rows at a time My code looks something like:

library(RJDBC)
library(DBI)
library(tidyverse)

options(java.parameters = "-Xmx8000m")

drv<-JDBC("driver name", "driver path.jar")

conn<-
  dbConnect(
    drv, 
    "database info",
    "username",
    "password"
)

query<-"SELECT * FROM some_table"

hc<-tibble()
res<-dbSendQuery(conn,query)
repeat{
  chunk<-dbFetch(res,100)
  if(nrow(chunk)==0){break}
  hc<-bind_rows(hc,chunk)
  print(nrow(hc))
}

基本上,我想写一些功能相同的东西,但是要结合使用 function lapply .从理论上讲,考虑到R通过循环处理数据的方式,使用 lapply 可以加快查询速度.对 dbFetch 函数的一些了解可能会有所帮助.具体来说,在 repeat 循环中,它不只是保持选择最初的前100行.

Basically, I would like write something that does the same thing, but via the combination of function and lapply. In theory, given the way R processes data via loops, using lapply will speed up query. Some understanding of the dbFetch function may help. Specifically, how in the repeat loop it doesn't just keep selecting the first initial 100 rows.

我尝试了以下方法,但没有任何效果:

I have tried the following, but nothing works:

df_list <- lapply(query , function(x) dbGetQuery(conn, x)) 

hc<-tibble()
res<-dbSendQuery(conn,query)
test_query<-function(x){
  chunk<-dbFetch(res,100)
  if(nrow(chunk)==0){break}
  print(nrow(hc))
}
bind_rows(lapply(test_query,res))

推荐答案

以下内容效果很好,因为它允许用户自定义块的大小和数量.理想情况下,该函数应以某种方式进行Vectorized.

The following works well, as it allows the user to customize the size and number of chunks. Ideally, the function would be Vectorized somehow.

我探索了获取行数以自动设置块数的方法,但是在没有实际需要首先执行查询的情况下,我找不到任何方法.添加大量的块不会增加大量的处理时间.相对于 repeat 方法而言,性能的提高取决于数据的大小,但是数据越大,性能的提高就越大.

I explored getting the number of rows to automatically set the chunk number, but I couldn't find any methods without actually needing to perform the query first. Adding a large number of chunks doesn't add a ton of extra process time. The performance improvement over the repeat approach depends on the size of the data, but the bigger the data the bigger the performance improvement.

n = 1000的块似乎始终产生最佳结果.对这些观点的任何建议将不胜感激.

Chunks of n = 1000 seem to consistently produce the best results. Any suggestions to these points would be much appreciated.

解决方案:

library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)

res<-dbSendQuery(conn,"SELECT * FROM some_table")
##Multiplied together need to be greater than N
chunk_size<-1000
chunk_number<-150

run_chunks<-
  function(chunk_number, res, chunk_size) {

    chunk <- 
     tryCatch(
      dbFetch(res, chunk_size),   
      error = function(e) NULL
     )

   if(!is.null(chunk)){
      return(chunk)
      }
    }


dat<-
  bind_rows(
    lapply(
      1:chunk_number,
      run_chunks,
      res,
      chunk_size
      )
    )

这篇关于使用R优化/向量化数据库查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆