如何使用MATLAB和JDBC加快表检索的速度? [英] How to speed up table-retrieval with MATLAB and JDBC?

查看:101
本文介绍了如何使用MATLAB和JDBC加快表检索的速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MATLAB调用的JDBC访问PostGreSQL 8.4数据库. 我感兴趣的表基本上由不同数据类型的各种列组成.通过时间戳选择它们.

I am accessing a PostGreSQL 8.4 database with JDBC called by MATLAB. The tables I am interested in basically consist of various columns of different datatypes. They are selected through their time-stamps.

由于我想检索大量数据,因此我正在寻找一种使请求比现在更快的方法.

Since I want to retrieve big amounts of data I am looking for a way of making the request faster than it is right now.

此刻我正在做的事情如下: 首先,我建立与数据库的连接,并将其命名为DBConn.下一步将是准备一个Select-Statement并执行它:

What I am doing at the moment is the following: First I establish a connection to the database and call it DBConn. Next step would be to prepare a Select-Statement and execute it:

QUERYSTRING = ['SELECT * FROM ' TABLENAME '...
WHERE ts BETWEEN ''' TIMESTART ''' AND ''' TIMEEND ''''];

QUERY = DBConn.prepareStatement(QUERYSTRING);
RESULTSET = QUERY.executeQuery();

然后将列类型存储在变量COLTYPE中(对于FLOAT为1,对于BOOLEAN为-1,其余为0-几乎所有列都包含FLOAT).下一步是逐行处理每一行,并通过相应的方法检索数据. FNAMES包含表的字段名称.

Then I store the columntypes in variable COLTYPE (1 for FLOAT, -1 for BOOLEAN and 0 for the rest - nearly all columns contain FLOAT). Next step is to process every row, column by column, and retrieve the data by the corresponding methods. FNAMES contains the fieldnames of the table.

m=0; % Variable containing rownumber

while RESULTSET.next()
  m = m+1;

  for n = 1:length(FNAMES)

    if COLTYPE(n)==1 % Columntype is a FLOAT
      DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getDouble(n);
    elseif COLTYPE(n)==-1 % Columntype is a BOOLEAN
      DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getBoolean(n);
    else
      DATA{1}.(FNAMES{n}){m,1} = char(RESULTSET.getString(n));
    end

  end

end

完成请求后,我将关闭语句和连接.

When I am done with my request I close the statement and the connection.

我没有MATLAB数据库工具箱,所以我正在寻找没有它的解决方案.

I don´t have the MATLAB database toolbox so I am looking for solutions without it.

我知道,请求每个字段的数据是非常无效的.但是,我仍然无法找到一种一次性获取更多数据的方法-例如同一列的多行.有什么办法吗?您还有其他加快请求的建议吗?

I understand that it is very ineffective to request the data of every single field. Still, I failed on finding a way to get more data at once - for example multiple rows of the same column. Is there any way to do so? Do you have other suggestions of speeding the request up?

推荐答案

摘要

要加快此速度,请使用数据库工具箱"或自定义Java代码将循环进行循环,然后将列数据类型转换向下进行至Java层. Matlab到Java的方法调用开销可能是杀死您的原因,并且无法使用纯JDBC进行块获取(一次调用中有多行).确保所使用的JDBC驱动程序上的旋钮设置正确.然后优化昂贵的列数据类型(如字符串和日期)的传输.

Summary

To speed this up, push the loops, and then your column datatype conversion, down in to the Java layer, using the Database Toolbox or custom Java code. The Matlab-to-Java method call overhead is probably what's killing you, and there's no way of doing block fetches (multiple rows in one call) with plain JDBC. Make sure the knobs on the JDBC driver you're using are set appropriately. And then optimize the transfer of expensive column data types like strings and dates.

(注意:我尚未使用Postgres进行此操作,但已使用其他DBMS进行此操作,这也将适用于Postgres,因为其中大部分与它上面的JDBC和Matlab层有关.)

(NB: I haven't done this with Postgres, but have with other DBMSes, and this will apply to Postgres too because most of it is about the JDBC and Matlab layers above it.)

最快速的方法是将行和列上的循环向下推送到Java层,并使它将数据块(例如一次100或1000行)返回到Matlab层.从Matlab调用Java方法会产生大量每次调用的开销,并且会在M代码中循环JDBC调用(请参阅

The most straightforward way to get this faster is to push the loops over the rows and columns down in to the Java layer, and have it return blocks of data (e.g. 100 or 1000 rows at a time) to the Matlab layer. There is substantial per-call overhead in invoking a Java method from Matlab, and looping over JDBC calls in M-code is going to incur (see Is MATLAB OOP slow or am I doing something wrong? - full disclosure: that's my answer). If you're calling JDBC from M-code like that, you're incurring that overhead on every single column of every row, and that's probably the majority of your execution time right now.

与ODBC一样,JDBC API本身不支持块游标",因此您需要使该循环进入Java层.像Oleg建议的那样使用Database Toolbox是做到这一点的一种方法,因为它们在Java中实现了较低级的游标. (可能正是出于这个原因.)但是,如果您没有数据库工具箱的依赖关系,则可以编写自己的瘦Java层,然后从M代码中调用它. (可能通过与您的自定义Java代码耦合的Matlab类,并且知道如何与之交互.)使Java代码和Matlab代码共享一个块大小,使用原始数组而不是在Java端缓冲整个块用于列缓冲区的对象数组,并让您的M代码分批提取结果集,将这些块缓冲在 primitive 列数组的单元格数组中,然后将它们串联在一起.

The JDBC API itself does not support "block cursors" like ODBC does, so you need to get that loop down in to the Java layer. Using the Database Toolbox like Oleg suggests is one way to do it, since they implement their lower-level cursor stuff in Java. (Probably for precisely this reason.) But if you can't have a database toolbox dependency, you can just write your own thin Java layer to do so, and call that from your M-code. (Probably through a Matlab class that is coupled to your custom Java code and knows how to interact with it.) Make the Java code and Matlab code share a block size, buffer up the whole block on the Java side, using primitive arrays instead of object arrays for column buffers wherever possible, and have your M-code fetch the result set in batches, buffering those blocks in cell arrays of primitive column arrays, and then concatenate them together.

Matlab层的伪代码:

Pseudocode for the Matlab layer:

colBufs = repmat( {{}}, [1 nCols] );
while (cursor.hasMore())
    cursor.fetchBlock();
    for iCol = 1:nCols
        colBufs{iCol}{end+1} = cursor.getBlock(iCol); % should come back as primitive
    end
end
for iCol = 1:nCols
    colResults{iCol} = cat(2, colBufs{iCol}{:});
end

旋转JDBC DBMS驱动程序旋钮

确保您的代码向M代码层公开了特定于DBMS的JDBC连接参数,然后使用它们.阅读适用于您特定DBMS的doco,并适当地摆弄它们.例如,Oracle的JDBC驱动程序默认将默认的提取缓冲区大小(在其JDBC驱动程序内,而不是您正在构建的缓冲区)设置为大约10行,这对于典型的数据分析集大小而言太小了. (每次缓冲区满时,都会导致数据库往返网络.)将其简单地设置为1,000或10,000行就像打开出厂时设置为"off"的"Go Fast"开关一样.使用样本数据集对您的速度进行基准测试,并绘制结果图形以选择适当的设置.

Twiddle JDBC DBMS driver knobs

Make sure your code exposes the DBMS-specific JDBC connection parameters to your M-code layer, and use them. Read the doco for your specific DBMS and fiddle with them appropriately. For example, Oracle's JDBC driver defaults to setting the default fetch buffer size (the one inside their JDBC driver, not the one you're building) to about 10 rows, which is way too small for typical data analysis set sizes. (It incurs a network round trip to the db every time the buffer fills.) Simply setting it to 1,000 or 10,000 rows is like turning on the "Go Fast" switch that had shipped set to "off". Benchmark your speed with sample data sets and graph the results to pick appropriate settings.

除了为您提供块获取功能外,编写自定义Java代码还为特定列类型提供了优化类型转换的可能性.处理完每行和每单元的Java调用开销后,您的瓶颈可能会出现在日期解析中,并将字符串从Java传递回Matlab.通过将日期类型转换为Java,可以在缓冲区中将SQL日期类型转换为Matlab datenum(随着Java的加倍,带有列类型指示符)转换为Java,可以使用缓存来避免重复计算日期同一套. (请注意TimeZone问题.请考虑使用Joda-Time.)在Java端将所有BigDecimal转换为double.而且,cellstr是一个很大的瓶颈-单个char列可能会淹没几个float列的成本.如果可以(返回大Java char[]然后使用reshape()),则返回窄的CHAR列作为2-d字符而不是cellstrs,并在必要时在Matlab端转换为cellstr. (将Java String[]转换为cellstr的效率较低.)并且您可以通过将低基数字符列作为符号"传递回来来优化对低基数字符列的检索-在Java方面,建立唯一字符串列表值并将其映射到数字代码,然后将字符串作为数字代码的原始数组以及该数字映射-> string返回;在Matlab一侧将不同的字符串转换为cellstr,然后使用索引将其扩展为完整的数组.这将更快,并为您节省大量内存,因为写时复制优化功能会将相同的原始char数据重用于重复的字符串值.或将它们转换为categoricalordinal对象,而不是cellstrs(如果适用).如果您使用大量的字符数据并具有较大的结果集,则此符号优化可能是一个 big 胜利,因为这样一来,您的字符串列将以大约原始的数字速度进行传输,这实际上要快得多,并且可以减少cellstr的典型的内存碎片. (数据库工具箱现在也可能支持其中一些功能.我已经有两年没有实际使用它了.)

In addition to giving you block fetch functionality, writing custom Java code opens up the possibility of doing optimized type conversion for particular column types. After you've got the per-row and per-cell Java call overhead handled, your bottlenecks are probably going to be in date parsing and passing strings back from Java to Matlab. Push the date parsing down in to Java by having it convert SQL date types to Matlab datenums (as Java doubles, with a column type indicator) as they're being buffered, maybe using a cache to avoid recalculation of repeated dates in the same set. (Watch out for TimeZone issues. Consider Joda-Time.) Convert any BigDecimals to double on the Java side. And cellstrs are a big bottleneck - a single char column could swamp the cost of several float columns. Return narrow CHAR columns as 2-d chars instead of cellstrs if you can (by returning a big Java char[] and then using reshape()), converting to cellstr on the Matlab side if necessary. (Returning a Java String[]converts to cellstr less efficiently.) And you can optimize the retrieval of low-cardinality character columns by passing them back as "symbols" - on the Java side, build up a list of the unique string values and map them to numeric codes, and return the strings as an primitive array of numeric codes along with that map of number -> string; convert the distinct strings to cellstr on the Matlab side and then use indexing to expand it to the full array. This will be faster and save you a lot of memory, too, since the copy-on-write optimization will reuse the same primitive char data for repeated string values. Or convert them to categorical or ordinal objects instead of cellstrs, if appropriate. This symbol optimization could be a big win if you use a lot of character data and have large result sets, because then your string columns transfer at about primitive numeric speed, which is substantially faster, and it reduces cellstr's typical memory fragmentation. (Database Toolbox may support some of this stuff now, too. I haven't actually used it in a couple years.)

此后,根据您的DBMS,可以通过将DBMS支持的所有数字列类型变量的映射包括到Matlab中适当的数字类型,并尝试在模式中使用它们或进行转换来提高速度.在您的SQL查询中.例如,在这样的db/Matlab堆栈中,Oracle的BINARY_DOUBLE可能比正常的NUMERIC快一点. YMMV.

After that, depending on your DBMS, you could squeeze out a bit more speed by including mappings for all the numeric column type variants your DBMS supports to appropriate numeric types in Matlab, and experimenting with using them in your schema or doing conversions inside your SQL query. For example, Oracle's BINARY_DOUBLE can be a bit faster than their normal NUMERIC on a full trip through a db/Matlab stack like this. YMMV.

您可以考虑通过使用更便宜的数字标识符替换字符串和日期列来优化针对此用例的架构,这些数字和标识符可能作为外键来分隔查找表以将其解析为原始字符串和日期.可以使用足够的架构知识将查询缓存到客户端.

You could consider optimizing your schema for this use case by replacing string and date columns with cheaper numeric identifiers, possibly as foreign keys to separate lookup tables to resolve them to the original strings and dates. Lookups could be cached client-side with enough schema knowledge.

如果您想发疯,则可以在Java级别使用多线程,以使其异步预取并在单独的Java工作线程上解析下一个结果块,如果您愿意的话,可以并行化每列日期和字符串处理在为上一个块执行M代码级处理时,游标块大小较大.但是,这确实提高了难度,并且在理想情况下,这是一个小小的性能胜利,因为您已经将昂贵的数据处理推到了Java层.最后保存.并检查JDBC驱动程序doco;它可能已经为您有效地做到了.

If you want to go crazy, you can use multithreading at the Java level to have it asynchronously prefetch and parse the next block of results on separate Java worker thread(s), possibly parallelizing per-column date and string processing if you have a large cursor block size, while you're doing the M-code level processing for the previous block. This really bumps up the difficulty though, and ideally is a small performance win because you've already pushed the expensive data processing down in to the Java layer. Save this for last. And check the JDBC driver doco; it may already effectively be doing this for you.

如果您不愿意编写自定义Java代码,则仍可以通过将Java方法调用的语法从obj.method(...)更改为method(obj, ...)来获得一定的提速.例如. getDouble(RESULTSET, n).这只是一个奇怪的Matlab OOP怪癖.但这不会有太大的成功,因为您仍然需要为每次调用支付Java/Matlab数据转换的费用​​.

If you're not willing to write custom Java code, you can still get some speedup by changing the syntax of the Java method calls from obj.method(...) to method(obj, ...). E.g. getDouble(RESULTSET, n). It's just a weird Matlab OOP quirk. But this won't be much of a win because you're still paying for the Java/Matlab data conversion on each call.

此外,考虑更改代码,以便可以在SQL查询中使用?占位符和绑定参数,而不是将字符串作为SQL文字插入.如果您正在执行自定义Java层,则定义自己的@connection和@preparedstatement M代码类是一种不错的方法.所以看起来像这样.

Also, consider changing your code so you can use ? placeholders and bound parameters in your SQL queries, instead of interpolating strings as SQL literals. If you're doing a custom Java layer, defining your own @connection and @preparedstatement M-code classes is a decent way to do this. So it looks like this.

QUERYSTRING = ['SELECT * FROM ' TABLENAME ' WHERE ts BETWEEN ? AND ?'];
query = conn.prepare(QUERYSTRING);
rslt = query.exec(startTime, endTime);

这将为您提供更好的类型安全性和更易读的代码,并且还可以减少查询解析的服务器端开销.在只有几个客户端的情况下,这不会给您带来很大的提速,但是它将使编码变得更容易.

This will give you better type safety and more readable code, and may also cut down on the server-side overhead of query parsing. This won't give you much speed-up in a scenario with just a few clients, but it'll make coding easier.

定期(在M代码和Java级别上)配置和测试代码,以确保瓶颈位于您认为的位置,并查看是否需要根据数据集大小调整参数,无论是行数还是列数和类型.我还喜欢在Matlab和Java层上构建一些检测和日志记录,以便您可以轻松进行性能测量(例如,让它汇总分析不同列类型所花费的时间,Java层中的多少以及Java层中的多少). Matlab层,以及等待服务器响应的时间(可能由于流水处理而等待的时间不多,但您永远不知道).如果您的DBMS公开了其内部工具,也许也可以使用它,这样您就可以知道您在服务器端花费的时间.

Profile and test your code regularly (at both the M-code and Java level) to make sure your bottlenecks are where you think they are, and to see if there are parameters that need to be adjusted based on your data set size, both in terms of row counts and column counts and types. I also like to build in some instrumentation and logging at both the Matlab and Java layer so you can easily get performance measurements (e.g. have it summarize how much time it spent parsing different column types, how much in the Java layer and how much in the Matlab layer, and how much waiting on the server's responses (probably not much due to pipelining, but you never know)). If your DBMS exposes its internal instrumentation, maybe pull that in too, so you can see where you're spending your server-side time.

这篇关于如何使用MATLAB和JDBC加快表检索的速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆