JDBC批处理操作理解 [英] JDBC batch operations understanding
问题描述
我在我的应用程序中使用 Hibernate ORM 和 PostgreSQL,有时我使用批处理操作.起初我不明白为什么在batch size = 25的日志中,会生成25个查询,一开始以为它不能正常工作.但是之后我查看了pg驱动的源代码,在PgStatement类中发现了以下几行:
I use Hibernate ORM and PostgreSQL in my application, and sometimes i use batch operations. And at first I didn't understand why in the logs with size of the batch = 25, 25 queries are generated, and at first thought that it does not work correctly. But after that I looked at the source code of the pg driver and found the following lines in the PgStatement class:
public int[] executeBatch() throws SQLException {
this.checkClosed();
this.closeForNextExecution();
if (this.batchStatements != null && !this.batchStatements.isEmpty()) {
this.transformQueriesAndParameters();
//confuses next line, because we have array of identical queries
Query[] queries = (Query[])this.batchStatements.toArray(new Query[0]);
ParameterList[] parameterLists =
(ParameterList[])this.batchParameters.toArray(new ParameterList[0]);
this.batchStatements.clear();
this.batchParameters.clear();
在 PgPreparedStatement 类中
and in PgPreparedStatement class
public void addBatch() throws SQLException {
checkClosed();
if (batchStatements == null) {
batchStatements = new ArrayList<Query>();
batchParameters = new ArrayList<ParameterList>();
}
batchParameters.add(preparedParameters.copy());
Query query = preparedQuery.query;
//confuses next line
if (!(query instanceof BatchedQuery) || batchStatements.isEmpty()) {
batchStatements.add(query);
}
}
我注意到事实证明,如果批次的大小变为 25,发送了 25 个带有附加参数的查询.
I noticed that it turns out that if the size of the batch goes 25, 25 queries are sent with the parameters attached to them.
数据库的日志证实了这一点,例如:
Logs of the database confirm this, for example:
2017-12-06 01:22:08.023 MSK [18402] postgres@buzzfactory СООБЩЕНИЕ: выполнение S_3: BEGIN
2017-12-06 01:22:08.024 MSK [18402] postgres@buzzfactory СООБЩЕНИЕ: выполнение S_4: select nextval ('tests_id_seq')
2017-12-06 01:22:08.041 MSK [18402] postgres@buzzfactory СООБЩЕНИЕ: выполнение S_2: insert into tests (name, id) values ($1, $2)
2017-12-06 01:22:08.041 MSK [18402] postgres@buzzfactory ПОДРОБНОСТИ: параметры: $1 = 'test', $2 = '1'
2017-12-06 01:22:08.041 MSK [18402] postgres@buzzfactory СООБЩЕНИЕ: выполнение S_2: insert into tests (name, id) values ($1, $2)
2017-12-06 01:22:08.041 MSK [18402] postgres@buzzfactory ПОДРОБНОСТИ: параметры: $1 = 'test', $2 = '2'
...
x23 queries with parameters
...
2017-12-06 01:22:08.063 MSK [18402] postgres@buzzfactory СООБЩЕНИЕ: выполнение S_5: COMMIT
但我认为必须使用包含 25 个参数的数组来执行一个查询.或者我不明白批量插入如何与准备好的语句一起工作?为什么重复一个查询 n 次?
But i thought one query must be executed with an array of 25 parameters. Or I don't understand how batch inserts work with a prepared statement? Why duplicate one query n times?
毕竟,我试图在这个地方调试我的查询
After all, i tried to debug my queries on this place
if (!(query instanceof BatchedQuery) || batchStatements.isEmpty()) {
并注意到我的查询始终是 SimpleQuery 的实例,而不是 BatchedQuery.也许这是解决问题的方法?我找不到有关 BatchedQuery 的信息
and noticed that my queries are always instance of SimpleQuery instead of BatchedQuery. Maybe this is the solution to the problem? Information about BatchedQuery i couldn't find
推荐答案
可能涉及到各种批处理,我将介绍其中的 PostgreSQL JDBC 驱动程序 (pgjdbc) 部分.
There might be various kinds of batching involved, and I would cover PostgreSQL JDBC driver (pgjdbc) part of it.
TL;DR:在使用批处理 API 的情况下,pgjdbc 确实使用较少的网络往返.BatchedQuery
仅在 reWriteBatchedInserts=true
传递给 pgjdbc 连接设置时使用.
TL;DR: pgjdbc does use less network roundrips in case batch API is used. BatchedQuery
is used only if reWriteBatchedInserts=true
is passed to the pgjdbc connection settings.
您可能会发现 https://www.slideshare.net/VladimirSitnikv/postgresql-and-jdbc-striving-for-high-performance 相关(幻灯片 44,...)
You might find https://www.slideshare.net/VladimirSitnikv/postgresql-and-jdbc-striving-for-high-performance relevant (slide 44,...)
在查询执行方面,网络延迟通常是所用时间的重要组成部分.
When it comes to query execution, network latency is often a significant part of the elapsed time.
假设情况是插入 10 行.
Suppose the case is to insert 10 rows.
没有批处理(例如,在循环中只是
PreparedStatement#execute
).驱动程序将执行以下操作
No batching (e.g. just
PreparedStatement#execute
in a loop). The driver would perform the following
execute query
sync <-- wait for the response from the DB
execute query
sync <-- wait for the response from the DB
execute query
sync <-- wait for the response from the DB
...
在等待数据库"中花费了大量时间
Notable time would be spent in the "waiting for the DB"
JDBC 批处理 API.即 PreparedStatement#addBatch()
使驱动程序能够在单个网络往返中发送多个查询执行".然而,当前的实现仍然会将大批量拆分为小批量以避免 TCP 死锁.
JDBC batch API. That is PreparedStatement#addBatch()
enables driver to send multiple "query executions" in a single network roundtrip. Current implementation, however would still split large batches into smaller ones to avoid TCP deadlock.
行动会更好:
execute query
...
execute query
execute query
execute query
sync <-- wait for the response from the DB
请注意,即使使用 #addBatch
,也存在执行查询"命令的开销.单独处理每条消息确实需要服务器花费大量时间.
Note, that even with #addBatch
, there's overhead of "execute query" commands. It does take server notable time to process each message individually.
减少查询次数的方法之一是使用多值插入.例如:
One of the ways to reduce the number of queries is to use multi-values insert. For instance:
insert into tab(a,b,c) values (?,?,?), (?,?,?), ..., (?,?,?)
这个 PostgreSQL 可以一次插入多行.缺点是您没有详细的(每行)错误消息.目前 Hibernate 没有实现多值插入.
This PostgreSQL enables to insert multiple rows at once. The drawback is you don't have detailed (per-row) error message. Currently Hibernate does not implement multi-values insert.
然而,从 9.4.1209 (2016-07-15) 开始,pgjdbc 可以将常规批量插入重写为多值.
However pgjdbc can rewrite regular batch inserts into multi-values on the fly since 9.4.1209 (2016-07-15).
为了激活多值重写,您需要添加reWriteBatchedInserts=true
连接属性.该功能最初是在 https://github.com/pgjdbc/pgjdbc/pull/491一个>
In order to activate multi-values rewrite, you need to add reWriteBatchedInserts=true
connection property. The feature was initially developed in https://github.com/pgjdbc/pgjdbc/pull/491
使用 2 条语句来插入 10 行已经足够聪明了.第一个是8值语句,第二个是2值语句.使用 2 的幂使 pgjdbc 能够保持不同语句的数量正常,这提高了性能,因为常用语句是服务器准备好的(参见 PostgreSQL 服务器端准备好的语句的生命周期是多少 )
It is smart enough to use 2 statements in order to insert 10 rows. The first one is 8-valued statement, and the second one is 2-valued statement. Usage of powers of two enables pgjdbc to keep the number of distinct statements sane, and that improves performance as often-used statements are server-prepared (see What's the life span of a PostgreSQL server-side prepared statement )
BatchedQuery
表示这种多值语句,因此您将看到该类仅在 reWriteBatchedInserts=true
情况下使用.
BatchedQuery
is representing that kind of multi-valued statements, so you will see that class used in reWriteBatchedInserts=true
case only.
该功能的缺点可能包括:作为批处理结果"的细节较低.例如,常规批处理为您提供每条语句行数",但是在多值情况下,您只会获得语句完成"状态.最重要的是,即时重写器可能无法解析某些 SQL 语句(例如 https://github.com/pgjdbc/pgjdbc/issues/1045).
The drawbacks of the feature might include: lower details as the "batch result". For instance, regular batch gives you "per statement rowcount", however in multi-values case you just get "statement completed" status. On top of that, on-the-fly rewritter might fail to parse certain SQL statements (e.g. https://github.com/pgjdbc/pgjdbc/issues/1045 ).
这篇关于JDBC批处理操作理解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!