从R到SQL插入数据帧的有效方法 [英] Efficient way to insert data frame from R to SQL
问题描述
我有一个要插入到现有sql表中的具有1000万行和5列的数据框.请注意,我没有创建表的权限,只能将值插入现有表中.我目前正在使用RODBCext
I have a data frame with 10 million rows and 5 columns that I want to insert to an existing sql table. Note that I do not have permission to create a table, I can only insert values into an existing table. I'm currently using RODBCext
query_ch <- "insert into [blah].[dbo].[blahblah]
(col1, col2, col3, col4, col5)
values (?,?,?,?,?)"
sqlExecute(channel, query_ch, my_data)
这花费的时间太长(超过10个小时).有没有办法更快地做到这一点?
This takes way too long (more than 10 hours). Is there a way accomplish this faster?
推荐答案
TL; DR::LOAD DATA INFILE
比多个INSERT
语句快一个数量级,这些语句本身也是比单个INSERT
语句快一个幅度.
TL;DR: LOAD DATA INFILE
is one order of magnitude faster than multiple INSERT
statements, which are themselves one order of magnitude faster than single INSERT
statements.
我将以下三种从R导入数据到Mysql的主要策略进行了基准测试:
I benchmark below the three main strategies to importing data from R into Mysql:
-
单个
insert
语句,如问题所示:
single
insert
statements, as in the question:
INSERT INTO test (col1,col2,col3) VALUES (1,2,3)
多个insert
语句,格式如下:
multiple insert
statements, formated like so:
INSERT INTO test (col1,col2,col3) VALUES (1,2,3),(4,5,6),(7,8,9)
load data infile
语句,即在mysql
中加载先前编写的CSV文件:
load data infile
statement, i.e. loading a previously written CSV file in mysql
:
LOAD DATA INFILE 'the_dump.csv' INTO TABLE test
我在这里使用RMySQL
,但是任何其他mysql驱动程序都应导致类似的结果. SQL表使用以下实例化:
I use RMySQL
here, but any other mysql driver should lead to similar results. The SQL table was instantiated with:
CREATE TABLE `test` (
`col1` double, `col2` double, `col3` double, `col4` double, `col5` double
) ENGINE=MyISAM;
连接和测试数据是在R
中使用以下方法创建的:
The connection and test data were created in R
with:
library(RMySQL)
con = dbConnect(MySQL(),
user = 'the_user',
password = 'the_password',
host = '127.0.0.1',
dbname='test')
n_rows = 1000000 # number of tuples
n_cols = 5 # number of fields
dump = matrix(runif(n_rows*n_cols), ncol=n_cols, nrow=n_rows)
colnames(dump) = paste0('col',1:n_cols)
对单个insert
语句进行基准测试:
Benchmarking single insert
statements:
before = Sys.time()
for (i in 1:nrow(dump)) {
query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES (',paste0(dump[i,],collapse = ','),');')
dbExecute(con, query)
}
time_naive = Sys.time() - before
=>这在我的计算机上大约需要 4分钟
=> this takes about 4 minutes on my computer
对多个insert
语句进行基准测试:
Benchmarking multiple insert
statements:
before = Sys.time()
chunksize = 10000 # arbitrary chunk size
for (i in 1:ceiling(nrow(dump)/chunksize)) {
query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES ')
vals = NULL
for (j in 1:chunksize) {
k = (i-1)*chunksize+j
if (k <= nrow(dump)) {
vals[j] = paste0('(', paste0(dump[k,],collapse = ','), ')')
}
}
query = paste0(query, paste0(vals,collapse=','))
dbExecute(con, query)
}
time_chunked = Sys.time() - before
=>这在我的计算机上大约需要 40秒
=> this takes about 40 seconds on my computer
基准化load data infile
语句:
Benchmarking load data infile
statement:
before = Sys.time()
write.table(dump, 'the_dump.csv',
row.names = F, col.names=F, sep='\t')
query = "LOAD DATA INFILE 'the_dump.csv' INTO TABLE test"
dbSendStatement(con, query)
time_infile = Sys.time() - before
=>这在我的计算机上大约需要 4秒
=> this takes about 4 seconds on my computer
创建SQL查询以处理许多插入值是提高性能的最简单方法.过渡到LOAD DATA INFILE
将导致最佳结果.可以在此mysql文档页面中找到良好的性能提示.
Crafting your SQL query to handle many insert values is the simplest way to improve the performances. Transitioning to LOAD DATA INFILE
will lead to optimal results. Good performance tips can be found in this page of mysql documentation.
这篇关于从R到SQL插入数据帧的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!