从HashMap高效批量插入/复制到表中 [英] Efficient Bulk INSERT/COPY into table from HashMap

查看:1393
本文介绍了从HashMap高效批量插入/复制到表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任务:

鉴于此HashMap结构: Map< String,Map< String,String>> mainMap = new HashMap<>()

Given this HashMap structure: Map<String, Map<String, String>> mainMap = new HashMap<>()

我要 INSERT COPY 内部 Map 的每个在我的数据库中。

I want to INSERT or COPY each value of the inner Map into its own cell in my database.


  • size() mainMap (如果为50,000)。

  • 内部 Map size()是50。

  • 要插入的表有50列。

  • 每个列的标题都是内部 Map 的键。

  • The size() of mainMap if 50,000.
  • The size() of the inner Map is 50.
  • The table to be inserted into has 50 columns.
  • Each column's header is the key for the inner Map.

编辑:最初,用户上载具有50列中的35列的大型电子表格。然后,我使用各种格式清理该数据,并为每个 mainMap 条目将自己的15对新数据添加到innerMap中。如果不进行清理/格式化/添加,就无法直接从用户源文件中将 COPY 从我的源文件复制到数据库中。

Originally the user uploads a large spreadsheet with 35 of the 50 columns. I then "cleanse" that data with various formatting, and I add my own 15 new pairs into the innerMap for each mainMap entry. I can't directly COPY from the user's source file to my database without cleansing/formatting/adding.

我已经完成了电子表格的迭代并构建了 mainMap ,那时我需要有效地插入数据库表中。

Once I'm done iterating the spreadsheet and building mainMap, that's when I need to insert into my database table efficiently.

研究:

阅读,认为 COPY 是最初批量填充表的最佳方法,但是我对我的要求是否允许该命令感到困惑。

I've read that COPY is the best approach to initially bulk populate a table, however I'm stuck on whether my requirements warrant that command.

这篇文章指出Postgres有一个Prepared Statement参数限制的查询34464。

This post states that Postgres has a Prepared Statement parameter limit of 34464 for a query.

我假设我总共需要50 x 50,000 = 2,500,000个参数。
这相当于〜73个独立查询!

I'm assuming I need 50 x 50,000 = 2,500,000 parameters in total. This equals out to ~ 73 individual queries!

问题:


  • COPY 此处使用适当的方法代替所有这些参数?

  • 如果是,我是否要转换 HashMap 值转换为 .sql 文件,将其保存在Web应用程序服务器的磁盘上,然后在我的 COPY 命令,然后删除临时文件?还是我可以直接将串联的 String 传递给它,而又不冒SQL注入的风险?

  • Is COPY the proper approach here instead of all these parameters?
  • If so, do I convert the HashMap values into a .sql file, save it on disk on my web app server, and then reference that in my COPY command, and then delete the temp file? Or can I directly pass a concatenated String into it, without risking SQL injection?

该命令经常发生,因此需要进行优化。

This command will be happening often, hence the need to be optimized.

我找不到将Java对象转换为兼容的Postgres文本文件格式的任何示例,因此任何反馈都会有所帮助。

I can't find any examples of converting Java objects into compatible Postgres text file formats, so any feedback helps.

您将如何解决此问题?

其他信息:

我的表已经存在,并且不能删除,因为它是我的Web应用程序的后端,并且在任何给定时间都已连接多个用户。

My table is pre-existing and can't be deleted since it's the back-end for my webapp and multiple users are connected at any given time.

我了解在使用 COPY 之前临时删除索引可以提高性能,但是我最多只需要插入50,000行或一次复制,而不是数百万。

I understand temporarily removing indexes prior to using COPY can increase performance, but I'm only requiring max 50,000 rows to be inserted or copied at a time, not millions.

StackExchange告诉我在这里问。

StackExchange told me to ask here.

推荐答案

虽然Java是当然不是执行此类ETL的最佳选择,使用标准 INSERT 语句和预准备的查询,当然可以并且开销很小:

While Java is certainly not the best option to do this kind of ETL, it certainly is possible and with rather little overhead using standard INSERT statements and prepared queries:

conn.setAutoCommit(false);
PreparedStatement stmt = conn.prepareStatement(
        "INSERT INTO my_table (col_a, col_b, ...)"
        + " VALUES (?, ?, ...)");
int batchSize = 1000;
int rows = 0;
for (Map<String, String> values : mainMap.values()) {
    int i = 0;
    stmt.setString(++i, values.get("col_a"));
    stmt.setString(++i, values.get("col_b"));
    // ...
    stmt.addBatch(); // add the row to the batch
    if (++rows % batchSize == 0) {
        // batch-sizing: execute...
        stmt.executeBatch();
    }
}

if (rows % batchSize != 0) {
    // a last execution if necessary...
    stmt.executeBatch();
}
conn.commit(); // atomic action - if any record fails, the whole import will fail

或者,您可以写出将地图映射到文件中,然后使用 CopyManager ,但我严重怀疑这样做的速度是否比批量插入的速度快(不过,成百万行的记录会有所不同)。

Alternatively, you could write out the Map into a file and use the CopyManager, but I seriously doubt this would be any faster than with the batched inserts (would be different for millions of rows, though).

这篇关于从HashMap高效批量插入/复制到表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆