Java ETL过程 [英] Java ETL process

查看:755
本文介绍了Java ETL过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个新的挑战,即从Oracle数据库加载大约100M行并将它们插入到远程MySQL数据库服务器中。



我将问题分为两部分:


  1. 负责将数据加载到MySQL服务器的服务器端REST服务器;
  2. 客户端这个应用程序负责加载Oracle数据源。

在Java方面,我使用普通JDBC加载分页内容并通过电线将其传输到服务器。这种方法运行良好,但它使代码繁琐,而且不能很好的扩展,因为我使用Oracle的ROWNUM进行分页..... WHERE ROWNUM> x和ROWNUM< y。



现在我已经通过Annotations映射了我的实体,并尝试了Hibernate的StatelessSession。代码更可读,更干净,但性能更差。



我听说过ETL工具和SpringBatch,但我不太了解它们。
是否还有其他解决此问题的方法?



预先致谢。



UPDATE



感谢您提供宝贵建议。
我选择使用 SpringBatch 从Oracle数据库加载数据,因为环境非常紧密,我无法访问Oracle的工具集。 SpringBatch是trie和真实的。
对于数据写入步骤,我选择使用MySQL的LOAD DATA INFILE编写大量记录,正如你所说的那样。由于安全原因,REST服务处于中间,因为它们彼此隐藏。

解决方案

100M行相当多。您可以通过很多方式进行设计:REST服务器,JDBC阅读, Spring Batch ,Hibernate,ETL。但底线是: time



无论您选择哪种架构,您最终都必须执行 INSERT 到MySQL中。你的里程可能会有所不同,但只是为了给你一个数量级:每秒插入2K,需要半天的时间才能用100M行填充MySQL( source )。

根据同一个源代码 LOAD DATA INFILE 可以处理大约25K次插入/秒(约10倍,大约一小时的工作)。



这就是我说的这么多的数据:


  • 使用生成可读内容的本地Oracle数据库工具转储Oracle表(或计算机可读,但您必须能够解析它)

  • >
  • 使用尽可能快的工具解析转储文件。也许 grep / sed / gawk / cut 就足够了吗?

  • 生成与MySQL兼容的目标文件 LOAD DATA INFILE (它是非常可配置的)


  • 使用上述命令在MySQL中导入文件

  • ul>

    当然,您可以使用Java来完成这项工作,代码非常好,可读性强,单元测试和版本控制。但有了这些数据,你需要务实。



    这是初始加载。之后,大概春季批次将是一个不错的选择。如果可以的话,尝试将应用程序直接连接到两个数据库 - 再次,这会更快。另一方面,这可能不会出于安全原因。



    如果您想要非常灵活并且不直接将自己绑定到数据库,请将输入(Oracle)和输出(MySQL)背后的Web服务(REST也很好)。 Spring集成将帮助你很多。


    I have this new challenge to load ~100M rows from an Oracle database and insert them in a remote MySQL database server.

    I've divided the problem in two:

    1. a server side REST server responsible for loading data into the MySQL server;
    2. a client side Java app that is responsible from loading the Oracle data source.

    At the Java side I've used plain JDBC for loading paginated content and transfer it over the wire to the server. This approach works well but it makes the code cumbersome and not very scalable as I'm doing pagination myself using Oracle's ROWNUM.....WHERE ROWNUM > x and ROWNUM < y.

    I've now tried Hibernate's StatelessSession with my entities mapped through Annotations. The code is much more readable and clean but the performance is worse.

    I've heard of ETL tools and SpringBatch but I don't know them very well. Are there other approaches to this problem?

    Thanks in advance.

    UPDATE

    Thank you for the invaluable suggestions. I've opted for using SpringBatch to load data from the Oracle database because the environment is pretty tight and I don't have access to Oracle's toolset. SpringBatch is trie and true. For the data writing step I opted for writing chunks of records using MySQL's LOAD DATA INFILE as you all stated. REST services are in the middle as they are hidden from each other for security reasons.

    解决方案

    100M rows is quite a lot. You can design it in plenty of ways: REST servers, JDBC reading, Spring Batch, Spring integration, Hibernate, ETL. But the bottom line is: time.

    No matter what architecture you choose, you eventually have to perform these INSERTs into MySQL. Your mileage may vary but just to give you an order of magnitude: with 2K inserts per second it'll take half a day to populate MySQL with 100M rows (source).

    According to the same source LOAD DATA INFILE can handle around 25K inserts/second (roughly 10x more and about an hour of work).

    That being said with such an amount of data I would suggest:

    • dump Oracle table using native Oracle database tools that produce human readable content (or computer readable, but you have to be able to parse it)

    • parse the dump file using as fast tools as you can. Maybe grep/sed/gawk/cut will be enough?

    • generate target file compatible with MySQL LOAD DATA INFILE (it is very configurable)

    • Import the file in MySQL using aforementioned command

    Of course you can do this in Java with nice and readable code, unit tested and versioned. But with this amount of data you need to be pragmatic.

    That is for initial load. After that probably Spring Batch will be a good choice. If you can, try to connect your application directly to both databases - again, this will be faster. On the other hand this might not be possible for security reasons.

    If you want to be very flexible and not tie yourself into databases directly, expose both input (Oracle) and output (MySQL) behind web-services (REST is fine as well). Spring integration will help you a lot.

    这篇关于Java ETL过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆