以极高的速度获取行 [英] Fetching rows at extremely high speed

查看:116
本文介绍了以极高的速度获取行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Oracle中有一个非常大的表(亿万行,包含数字和字符串),我需要读取该表的所有内容,对其进行格式化并写入文件或任何其他资源. 通常,我的解决方案如下所示:

I have very large table (hundreds of millions rows, contains numbers and strings) in Oracle and I need to read all content of this table, format it and write to file or any other resource. Generally my solution looks like this:

package my.odp;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
import java.lang.Throwable;
import java.sql.*;


public class Main {
public static volatile boolean finished = false;

public static void main(final String[] args) throws InterruptedException {
    final ArrayBlockingQueue<String> queue = new ArrayBlockingQueue<String>(10000);
    final Thread writeWorker = new Thread("ODP Writer") {
        public void run() {
            try {
                File targetFile = new File(args[0]);
                FileWriter fileWriter = new FileWriter(targetFile);
                BufferedWriter writer = new BufferedWriter(fileWriter);
                String str;
                try {
                    while (!finished) {
                        str = queue.poll(200, TimeUnit.MILLISECONDS);
                        if (str == null) {
                            Thread.sleep(50);
                            continue;
                        }
                        writer.write(str);
                        writer.write('\n');
                    }
                } catch (InterruptedException e) {
                    writer.close();
                    return;
                }
            }
            catch (Throwable e) {
                e.printStackTrace();
                return;
            }
        }
    };

    final Thread readerThread = new Thread("ODP Reader") {
        public void run() {
            try {
                Class.forName("oracle.jdbc.OracleDriver");
                Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");

                Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
                stmt.setFetchSize(500000);
                ResultSet rs = stmt.executeQuery("select * from src_schema.big_table_view");
                System.out.println("Fetching result");
                while (rs.next()) {
                    StringBuilder sb = new StringBuilder();
                    sb.append(rs.getString(1)).append('\t');//OWNER
                    sb.append(rs.getString(2)).append('\t');//OBJECT_NAME
                    sb.append(rs.getString(3)).append('\t');//SUBOBJECT_NAME
                    sb.append(rs.getLong(4)).append('\t');//OBJECT_ID
                    sb.append(rs.getLong(5)).append('\t');//DATA_OBJECT_ID
                    sb.append(rs.getString(6)).append('\t');//OBJECT_TYPE
                    sb.append(rs.getString(7)).append('\t');//CREATED
                    sb.append(rs.getString(8)).append('\t');//LAST_DDL_TIME
                    sb.append(rs.getString(9)).append('\t');//TIMESTAMP
                    sb.append(rs.getString(10)).append('\t');//STATUS
                    sb.append(rs.getString(11)).append('\t');//TEMPORARY
                    sb.append(rs.getString(12)).append('\t');//GENERATED
                    sb.append(rs.getString(13)).append('\t');//SECONDARY
                    sb.append(rs.getString(14)).append('\t');//NAMESPACE
                    sb.append(rs.getString(15));//EDITION_NAME
                    queue.put(sb.toString());
                }

                rs.close();
                stmt.close();
                conn.close();
                finished = true;
            } catch (Throwable e) {
                e.printStackTrace();
                return;
            }
        }
    };
    long startTime = System.currentTimeMillis();
    writeWorker.start();
    readerThread.start();
    System.out.println("Waiting for join..");
    writeWorker.join();
    System.out.println("Exit:"+ (System.currentTimeMillis() - startTime));
}

}

有两个线程:一个用于从结果集中获取行,另一个用于写入字符串值.测得的加载速度约为10Mb/s,就我而言,我需要使其速度提高10倍. Profiler显示最耗时的方法是

There're two threads: one for fetching rows from result set and one for writing string values. Measured loading speed was about 10Mb/s and in my case I need to make it 10 times faster. Profiler shows that the most time consuming methods are

oracle.jdbc.driver.OracleResultSetImpl.getString()

oracle.net.ns.Packet.receive()

您对如何使jdbc更快地加载数据有任何想法吗? 关于查询优化,字符串加载优化,调整JDBC驱动程序或使用其他驱动程序(直接使用oracle JDBC实现,调整Oracle)的任何构想都将受到赞赏.

Do you have any ideas how to make jdbc to load data much faster? Any ideas about query optimisation, string loading optimisation, tweaking JDBC driver or using another one, direct using oracle JDBC implementations, tweaking Oracle is appreciated.

更新: 我在下面汇总并列出了讨论结果:

UPDATE: I compiled and listed discussion results below:

  1. 除了与Oracle db的连接之外,我无权访问DBMS服务器,并且服务器无法连接至任何外部资源.任何使用服务器或远程文件系统的转储和提取工具均无法应用,也无法在服务器上安装和使用任何外部Java或PL/SQL例程.只有连接才能执行查询-仅此而已.

  1. I've no access to DBMS server except connection to Oracle db and server can't connect to any external resource. Any dump and extraction utils which use server's or remote file system can't be applied, also it's impossible to install and use any external java or PL/SQL routines on server. Only connetion to execute queries - thats all.

我使用了探查器并在Oracle JDBC驱动程序中进行了挖掘.我发现最昂贵的操作是读取数据,即Socket.read().所有字符串字段均表示为一个char数组,并且几乎不影响性能.通常,我使用事件探查器检查了整个应用程序,而Socket.read()绝对是最昂贵的操作.提取字段,构建字符串,写入数据几乎不消耗任何东西.问题仅在于读取数据.

I used profiler and digged in Oracle JDBC driver. I found out that the most expencive operation is reading data, i.e. Socket.read(). All string fields are represented as one char array and have almost no influence on perfomance. Generally, I checked with profiler the whole app and Socket.read() is definitely the most expensive operation. Extracting fields, building strings, writing data consume almost nothing. The problem is only in reading data.

服务器端数据表示上的任何优化都没有真正的效果.串联字符串和转换时间戳不会提高性能.

Any optimisations in data representation on server side don't have real effect. Concatenating strings and converting timestamps have no result for performance.

App被重写为具有几个读取器线程,这些线程将准备就绪的数据放入写入器队列中.每个线程都有自己的连接,不使用任何池,因为它们减慢了提取速度(我使用了oracle建议的UCP池,它消耗了大约10%的执行时间,因此我放弃了).结果集fetchSize也增加了,因为从默认值(10)切换到50000会使性能增长高达50%.

App was rewritten to have several reader threads which put ready data in writer queue. Each thread has its own connection, no pools are used because they slow down the extraction (I used UCP pool which is recommended by oracle and it consumes about 10% of execution time, so i gave up from it). Also result set fetchSize was increased because switching from default value (10) to 50000 gives up to 50% perfomance growth.

我测试了多线程版本在4个读取线程中的工作方式,发现增加的读者数只会减慢提取速度. 我尝试启动两个实例,每个实例都有两个读取器,并且两个实例都与单个实例同时工作,即双倍数据提取所需的时间与单个实例相同.不知道为什么会这样,但是看起来oracle驱动程序有一些性能限制.具有4个独立连接的应用程序要比具有2个连接的2个应用程序实例慢. (使用Profiler来确保驱动程序的Socket.read()仍然是主要问题,所有其他部分在多线程模式下均能正常工作).

I tested how multithreaded version works with 4 reading threads and found out that increasing readers count only slows the extraction. I tried to launch 2 instances where each of them has two readers and both worked the same time as single instance, i.e. double data extraction requires same time as single. Don't know why this happens, but it looks like oracle driver have some performance limitations. Application with 4 independent connections works slower then 2 App instances with 2 connections. (Profiler was used to ensure that driver's Socket.read() is still the main issue, all other parts works fine in multithreaded mode).

我尝试使用SAS提取所有数据,并且它执行相同提取的速度是JDBC的2倍,两者都使用到Oracle的单一连接并且不能使用任何转储操作. Oracle确保JDBC瘦驱动程序与本机驱动程序一样快.

I tried to fetch all data with SAS and it can perform same extraction 2 times faster then JDBC, both used single connection to Oracle and can't use any dump operations. Oracle ensures that JDBC thin driver is as fast as native one..

也许Oracle还有另一种方法可以通过ODBC或其他方式对远程主机进行快速提取?

Maybe Oracle have another ways to perform fast extraction to remote host through ODBC or smth else?

推荐答案

假定您已经检查了基本的网络内容,例如接口,防火墙,代理以及数据库服务器的硬件元素.

Assuming you have already checked the basic network stuff like interfaces, firewalls, proxies, as also the DB server's hardware elements.

选项1:

而不是:

Class.forName("oracle.jdbc.OracleDriver");
Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");

尝试使用:

OracleDataSource ods = new OracleDataSource();
java.util.Properties prop = new java.util.Properties();
prop.setProperty("MinLimit", "2");
prop.setProperty("MaxLimit", "10");
String url = "jdbc:oracle:oci8:@//xxx.xxx.xxx.xxx:1521/orcl";
ods.setURL(url);
ods.setUser("USER");
ods.setPassword("PWD");
ods.setConnectionCachingEnabled(true);
ods.setConnectionCacheProperties (prop);
ods.setConnectionCacheName("ImplicitCache01");

更多详细信息此处

选项2:提取大小

正如斯蒂芬强烈指出的那样,fetchsize似乎太大了.

As strongly pointed by Stephen, the fetchsize seems too large.

并且,对于500,000的获取大小,您的-Xms和-Xmx是什么.另外,在事件探查器中,最大的堆大小是多少?

And, for fetch size of 500,000 what is your -Xms and -Xmx. Also, in profiler, whats the highest heap size?

选项3:数据库

  • 检查src_schema.big_table_view

这是工具还是应用程序系统.如果只是一种工具,您可以 基于数据库系统添加并行度,索引提示,分区等 功能

Is this a tool or an application system. If just a tool, you could add parallel degrees, index hints, partitioning etc based on DB systems capabilities

选项4:线程

n<应用程序服务器上的内核数

Say n < Number of cores on application server

您可以启动writer的n线程,每个线程都配置为处理特定的存储桶,例如线程1处理0到10000,写入n个不同的文件,并且所有的任务完成后,加入后,最好使用低级OS命令将文件合并在一起.

You can start n Threads of writer, each configured to process a certain bucket e.g. thread1 processes 0 to 10000, writing to n different files, and once all theads done, post join, merge the files together preferably using a low level OS command.

也就是说,所有这些都不应该像现在这样是预定义的代码. 'n',并且应该在运行时计算存储桶.并且创建的线程数比系统仅支持的更多.

That said, all this should never be pre-defined code like its now. 'n' and the buckets should be calculated at runtime. And creating number of threads more than what your system supports only screws up.

选项5:

代替

select * from src_schema.big_table_view

您可以使用

SELECT column1||CHR(9)||column2||CHR(9).....||columnN FROM src_schema.big_table_view

这避免了创建500000 StringBuildersStrings. (假设不涉及其他复杂格式). CHR(9)是制表符.

This avoids creating 500000 StringBuilders and Strings. (Assuming no other complex formatting involved). CHR(9) is the tab character.

选项6:

同时,您还可以向DBA查询任何数据库系统问题,并通过 Oracle支持提出SR.

Meantime, you could also check with your DBA for any DB system issues and raise an SR with Oracle support.

这篇关于以极高的速度获取行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆