从HashMap高效批量插入/复制到表中 [英] Efficient Bulk INSERT/COPY into table from HashMap
问题描述
任务:
鉴于此HashMap结构: Map< String,Map< String,String>> mainMap = new HashMap<>()
Given this HashMap structure: Map<String, Map<String, String>> mainMap = new HashMap<>()
我要 INSERT
或 COPY
内部 Map
的每个值在我的数据库中。
I want to INSERT
or COPY
each value of the inner Map
into its own cell in my database.
-
的
(如果为50,000)。size()
mainMap - 内部
Map
的size()
是50。 - 要插入的表有50列。
- 每个列的标题都是内部
Map
的键。
- The
size()
ofmainMap
if 50,000. - The
size()
of the innerMap
is 50. - The table to be inserted into has 50 columns.
- Each column's header is the key for the inner
Map
.
编辑:最初,用户上载具有50列中的35列的大型电子表格。然后,我使用各种格式清理该数据,并为每个 mainMap
条目将自己的15对新数据添加到innerMap中。如果不进行清理/格式化/添加,就无法直接从用户源文件中将 COPY
从我的源文件复制到数据库中。
Originally the user uploads a large spreadsheet with 35 of the 50 columns. I then "cleanse" that data with various formatting, and I add my own 15 new pairs into the innerMap for each mainMap
entry. I can't directly COPY
from the user's source file to my database without cleansing/formatting/adding.
我已经完成了电子表格的迭代并构建了 mainMap
,那时我需要有效地插入数据库表中。
Once I'm done iterating the spreadsheet and building mainMap
, that's when I need to insert into my database table efficiently.
研究:
我阅读,认为 COPY
是最初批量填充表的最佳方法,但是我对我的要求是否允许该命令感到困惑。
I've read that COPY
is the best approach to initially bulk populate a table, however I'm stuck on whether my requirements warrant that command.
这篇文章指出Postgres有一个Prepared Statement参数限制的查询34464。
This post states that Postgres has a Prepared Statement parameter limit of 34464 for a query.
我假设我总共需要50 x 50,000 = 2,500,000个参数。
这相当于〜73个独立查询!
I'm assuming I need 50 x 50,000 = 2,500,000 parameters in total. This equals out to ~ 73 individual queries!
问题:
- 是
COPY
此处使用适当的方法代替所有这些参数? - 如果是,我是否要转换
HashMap $ c将$ c>值转换为
.sql
文件,将其保存在Web应用程序服务器的磁盘上,然后在我的COPY $ c中引用该值$ c>命令,然后删除临时文件?还是我可以直接将串联的
String
传递给它,而又不冒SQL注入的风险?
- Is
COPY
the proper approach here instead of all these parameters? - If so, do I convert the
HashMap
values into a.sql
file, save it on disk on my web app server, and then reference that in myCOPY
command, and then delete the temp file? Or can I directly pass a concatenatedString
into it, without risking SQL injection?
该命令经常发生,因此需要进行优化。
This command will be happening often, hence the need to be optimized.
我找不到将Java对象转换为兼容的Postgres文本文件格式的任何示例,因此任何反馈都会有所帮助。
I can't find any examples of converting Java objects into compatible Postgres text file formats, so any feedback helps.
您将如何解决此问题?
其他信息:
我的表已经存在,并且不能删除,因为它是我的Web应用程序的后端,并且在任何给定时间都已连接多个用户。
My table is pre-existing and can't be deleted since it's the back-end for my webapp and multiple users are connected at any given time.
我了解在使用 COPY
之前临时删除索引可以提高性能,但是我最多只需要插入50,000行或一次复制,而不是数百万。
I understand temporarily removing indexes prior to using COPY
can increase performance, but I'm only requiring max 50,000 rows to be inserted or copied at a time, not millions.
StackExchange告诉我在这里问。
StackExchange told me to ask here.
推荐答案
虽然Java是当然不是执行此类ETL的最佳选择,使用标准 INSERT
语句和预准备的查询,当然可以并且开销很小:
While Java is certainly not the best option to do this kind of ETL, it certainly is possible and with rather little overhead using standard INSERT
statements and prepared queries:
conn.setAutoCommit(false);
PreparedStatement stmt = conn.prepareStatement(
"INSERT INTO my_table (col_a, col_b, ...)"
+ " VALUES (?, ?, ...)");
int batchSize = 1000;
int rows = 0;
for (Map<String, String> values : mainMap.values()) {
int i = 0;
stmt.setString(++i, values.get("col_a"));
stmt.setString(++i, values.get("col_b"));
// ...
stmt.addBatch(); // add the row to the batch
if (++rows % batchSize == 0) {
// batch-sizing: execute...
stmt.executeBatch();
}
}
if (rows % batchSize != 0) {
// a last execution if necessary...
stmt.executeBatch();
}
conn.commit(); // atomic action - if any record fails, the whole import will fail
或者,您可以写出将地图映射到文件中,然后使用 CopyManager
,但我严重怀疑这样做的速度是否比批量插入的速度快(不过,成百万行的记录会有所不同)。
Alternatively, you could write out the Map into a file and use the CopyManager
, but I seriously doubt this would be any faster than with the batched inserts (would be different for millions of rows, though).
这篇关于从HashMap高效批量插入/复制到表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!