从R到SQL插入数据帧的有效方法 [英] Efficient way to insert data frame from R to SQL

查看:84
本文介绍了从R到SQL插入数据帧的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要插入到现有sql表中的具有1000万行和5列的数据框.请注意,我没有创建表的权限,只能将值插入现有表中.我目前正在使用RODBCext

I have a data frame with 10 million rows and 5 columns that I want to insert to an existing sql table. Note that I do not have permission to create a table, I can only insert values into an existing table. I'm currently using RODBCext

query_ch <- "insert into [blah].[dbo].[blahblah] 
               (col1, col2, col3, col4, col5)
               values (?,?,?,?,?)"

sqlExecute(channel, query_ch, my_data) 

这花费的时间太长(超过10个小时).有没有办法更快地做到这一点?

This takes way too long (more than 10 hours). Is there a way accomplish this faster?

推荐答案

TL; DR::LOAD DATA INFILE比多个INSERT语句快一个数量级,这些语句本身也是比单个INSERT语句快一个幅度.

TL;DR: LOAD DATA INFILE is one order of magnitude faster than multiple INSERT statements, which are themselves one order of magnitude faster than single INSERT statements.

我将以下三种从R导入数据到Mysql的主要策略进行了基准测试:

I benchmark below the three main strategies to importing data from R into Mysql:

  1. 单个insert语句,如问题所示:

  1. single insert statements, as in the question:

INSERT INTO test (col1,col2,col3) VALUES (1,2,3)

多个insert语句,格式如下:

multiple insert statements, formated like so:

INSERT INTO test (col1,col2,col3) VALUES (1,2,3),(4,5,6),(7,8,9)

load data infile语句,即在mysql中加载先前编写的CSV文件:

load data infile statement, i.e. loading a previously written CSV file in mysql:

LOAD DATA INFILE 'the_dump.csv' INTO TABLE test


我在这里使用RMySQL,但是任何其他mysql驱动程序都应导致类似的结果. SQL表使用以下实例化:


I use RMySQL here, but any other mysql driver should lead to similar results. The SQL table was instantiated with:

CREATE TABLE `test` (
  `col1` double, `col2` double, `col3` double, `col4` double, `col5` double
) ENGINE=MyISAM;

连接和测试数据是在R中使用以下方法创建的:

The connection and test data were created in R with:

library(RMySQL)
con = dbConnect(MySQL(),
                user = 'the_user',
                password = 'the_password',
                host = '127.0.0.1',
                dbname='test')

n_rows = 1000000 # number of tuples
n_cols = 5 # number of fields
dump = matrix(runif(n_rows*n_cols), ncol=n_cols, nrow=n_rows)
colnames(dump) = paste0('col',1:n_cols)


对单个insert语句进行基准测试:


Benchmarking single insert statements:

before = Sys.time()
for (i in 1:nrow(dump)) {
  query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES (',paste0(dump[i,],collapse = ','),');')
  dbExecute(con, query)
}
time_naive = Sys.time() - before 

=>这在我的计算机上大约需要 4分钟

=> this takes about 4 minutes on my computer

对多个insert语句进行基准测试:

Benchmarking multiple insert statements:

before = Sys.time()
chunksize = 10000 # arbitrary chunk size
for (i in 1:ceiling(nrow(dump)/chunksize)) {
  query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES ')
  vals = NULL
  for (j in 1:chunksize) {
    k = (i-1)*chunksize+j
    if (k <= nrow(dump)) {
      vals[j] = paste0('(', paste0(dump[k,],collapse = ','), ')')
    }
  }
  query = paste0(query, paste0(vals,collapse=','))
  dbExecute(con, query)
}
time_chunked = Sys.time() - before 

=>这在我的计算机上大约需要 40秒

=> this takes about 40 seconds on my computer

基准化load data infile语句:

Benchmarking load data infile statement:

before = Sys.time()
write.table(dump, 'the_dump.csv',
          row.names = F, col.names=F, sep='\t')
query = "LOAD DATA INFILE 'the_dump.csv' INTO TABLE test"
dbSendStatement(con, query)
time_infile = Sys.time() - before 

=>这在我的计算机上大约需要 4秒

=> this takes about 4 seconds on my computer

创建SQL查询以处理许多插入值是提高性能的最简单方法.过渡到LOAD DATA INFILE将导致最佳结果.可以在此mysql文档页面中找到良好的性能提示.

Crafting your SQL query to handle many insert values is the simplest way to improve the performances. Transitioning to LOAD DATA INFILE will lead to optimal results. Good performance tips can be found in this page of mysql documentation.

这篇关于从R到SQL插入数据帧的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆