多线程数据库读取 [英] Multi threading database reading

查看:311
本文介绍了多线程数据库读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的Java应用程序中,我需要从oracle数据库中读取大约8000万条记录。我正在尝试为此重新设计多线程程序。目前,我们正在使用Java 5线程池,其中10个线程基于主键模式并行读取数据库。每个线程将读取不同的模式,如001 *和002 *。

In our Java application I have requirement to read the around 80 million records from oracle database. I am trying to redesign the multithreading program for this. Currently we are using Java 5 thread pooling with 10 threads reading the data base in parallel based on a primary key pattern. Each thread will read different pattern like 001* and 002*.

如何提高此计划的表现?我正在考虑设计模式,有主线程来读取数据库并将处理委托给子线程。在我们现有的设计中,不同的线程通过10个jdbc连接访问表。使用新方法,我将只有一个线程读取表格。

How can I improve the performance of this program? I am thinking of design pattern, having lead thread to read the database and delegate the processing to child threads. In our existing design, different threads access the table by having 10 jdbc connections. With the new approach I will have only one thread reading the table.

我们为每个线程都有不同的select语句,例如从+ table +中选择 count(*),其中id类似于+ format +% '

We have different select statement for each thread like select count(*) from "+table+ " where id like '"+format + "%'".

好听的读取rowid模式很好,但是读取rowid或rownum是否很好?

Ok sounds reading rowid pattern good, but is it good to read by rowid or rownum ?

任何机构可以采用新方法的优缺点,还有其他方法可以实现它。

Can any body please pros and cons of new approach and is there other way we can implement it.

推荐答案

网络



首先,因为无论如何使用 rowid rownum 是供应商锁定的,所以你应该考虑使用数据库存储的例程。它可以显着减少从数据库传输数据到应用服务器的开销(特别是如果它们在不同的机器上并通过网络连接)。

Network

First of all, since using rowid and rownum is vendor-lock anyway, you should consider using database stored routines. It could significantly reduce overhead of transmitting data from database to the application server (especially if they are on different machines and connected through network).

考虑你有8000万条记录需要传输,这对你来说可能是最好的性能提升,不过这取决于线程的工作类型。

Considering that you have 80 million records to transmit, that could be the best performance boost for you, though it depends on kind of work your threads do.

显然增加带宽也会帮助解决网络问题。

Obviously increasing bandwidth would also help to solve networking issues.

在进行代码更改之前检查硬盘驱动器加载任务运行,也许它只是无法处理那么多的I / O(同时读取10个线程)。

Before making changes in code check the hard drive load while tasks running, perhaps it just can't handle that much I/O (10 threads reading simultaneously).

迁移到SSD / RAID或群集数据库可能会解决问题。虽然在这种情况下不会改变访问数据库的方式。

Migrating to SSD/RAID or clustering database might solve the issue. While changing the way you access database won't in that case.

多线程可以解决CPU问题,但数据库主要依赖于磁盘系统。

Multithreading could solve CPU problems, but databases mostly depend on disk system.

如果要使用rowid和rownum实现它,可能会遇到一些问题。

There are a couple of problems you might face if you will be implementing it using rowid and rownum.

1)为每个查询的结果动态生成 rownum 。因此,如果查询没有显式的
排序,并且每次运行查询时某些记录都可能有不同的rownum。

1) rownum is generated on the fly for each query's results. So if query doesn't have explicit sorting and it is possible that some record have different rownum every time you run query.

例如,您先运行它时间并获得如下结果:

For example you run it first time and get results like this:

some_column | rownum
____________|________
     A      |    1
     B      |    2
     C      |    3

然后你第二次运行它,因为你没有显式排序,dbms(对某些人来说)本身已知的原因)决定返回这样的结果:

then you run it second time, since you don't have explicit sorting, dbms (for some reason known to itself) decides to return results like this:

some_column | rownum
____________|________
     C      |    1
     A      |    2
     B      |    3

2)第1点也暗示如果您将过滤 rownum上的结果它将生成具有所有结果的临时表格,然后过滤它

2) point 1 also implies that if you will be filtering results on rownum it will generate temporary table with ALL results and then filter it

所以 rownum 不是分割结果的好选择。虽然 rowid 似乎更好,但它也存在一些问题。

So rownum is not a good choice for splitting results. While rowid seemed better, it has some issues too.

如果你看一下 ROWID描述您可能会注意到rowid值唯一标识数据库中的行

If you look at the ROWID description you may notice that "rowid value uniquely identifies a row in the database".

因此和事实上,当您删除行时,您在rowid序列中有一个漏洞,rowid可能在表记录中不均匀分布。

Because of that and the fact that when you delete a row you have a "hole" in rowid sequence, rowids may be distributed not equally among table records.

所以例如,如果你有三个线程并且每个获取1'000'000 rowid,那么一个将获得1'000'000个记录而另外两个1记录每个。所以一个人会不堪重负,而另外两个饿死

So for example if you have three threads and each fetching 1'000'000 rowids, it is possible that one will get 1'000'000 records and other two 1 record each. So one will be overwhelmed, while two other starving.

在你的情况下这可能不是什么大问题,尽管它非常好吧可能是您目前面临的主键模式问题。

或者如果您首先获取调度程序中的所有rowid然后将它们平均分配(如彼得.petrov建议)可以做到这一点,虽然取出8000万个ID仍然听起来很多,我认为用一个返回块的边框的sql查询进行拆分会更好。

Or if you first fetch all rowids in dispatcher and then divide them equally (like peter.petrov suggested) that could do the thing, though fetching 80 million ids still sounds like a lot, I think it would be better to do the splitting with one sql-query that returns borders of chunks.

或者您可以通过为每个任务提供少量rowid并使用Java 7中引入的Fork-Join框架来解决该问题,但是它应该是 used 小心

Or you might solve that problem by giving low amount of rowids per task and using Fork-Join framework introduced in Java 7, however it should be used carefully.

同样显而易见的是:rownum和rowid都不能跨数据库移植。

Also obvious point: both rownum and rowid are not portable across databases.

所以拥有自己的分片列会好得多,但是你必须确保自己以大致相等的块分割记录。

So it is much better to have your own "sharding" column but then you will have to make sure yourself that it splits records in more or less equal chunks.

还要记住,如果要在多个线程中执行此操作,请务必检查锁定模式数据库使用,或许只是锁定对于每次访问的表,然后多线程是没有意义的。

Also keep in mind that if you are going to do it in several threads it is important to check what locking mode database uses, perhaps it just locks the table for every access, then multithreading is pointless.

正如其他人建议的那样,你最好先找到性能低下的主要原因(网络,磁盘,数据库锁定) ,线程饥饿或者你可能只有次优查询 - 检查查询计划。)

As others suggested, you'd better first find what is the main reason of low performance (network, disk, database locking, thread starvation or maybe you just have suboptimal queries - check the query plans).

这篇关于多线程数据库读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆