将数据从oracle移动到HDFS,处理并从HDFS移动到Teradata [英] Move data from oracle to HDFS, process and move to Teradata from HDFS

查看:391
本文介绍了将数据从oracle移动到HDFS,处理并从HDFS移动到Teradata的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的要求是



  1. 将数据从Oracle移动到HDFS

  2. 在HDFS上处理数据

  3. 将处理过的数据移动到Teradata。


还需要每15分钟完成一次这个处理。源数据量可能接近50 GB,并且处理后的数据也可能相同。

在互联网上搜索了很多内容之后,我发现



  1. ORAOOP将数据从Oracle移动到HDFS(让代码使用shell脚本并将其安排在所需的
  2. 通过Custom MapReduce或Hive或PIG进行大规模处理。
  3. SQOOP - Teradata Connector将数据从HDFS移动到Teradata再次有一个带脚本的shell脚本,然后安排它)。


这是正确的选项首先,这是可行的所需时间段(请注意,这不是每日批次左右)?



我发现其他选项如下



  1. STORM(用于实时数据处理)。但是我无法从盒子中找到Oracle Spout或Teradata。

  2. 任何开源ETL工具,如Talend或Pentaho。

>

请分享您对这些选项的看法以及其他可能性。 解决方案

看起来你有几个问题,所以我们试着分解它。



导入HDFS



您似乎在寻找 Sqoop 。 Sqoop是一款工具,可让您轻松地将数据传入/传出HDFS,并可以本地连接到各种数据库,包括Oracle。 Sqoop与Oracle JDBC瘦驱动程序兼容。这里是你如何从Oracle转移到HDFS:

  sqoop import --connect jdbc:oracle:thin @ myhost:1521 / db --username xxx --password yyy --table tbl --target-dir / path / to / dir 

更多信息:此处 here 。请注意,您也可以直接将其导入Hive表格并使用Sqoop进行分析。

处理



正如您所指出的那样,由于您的数据最初是关系型的,因此您可以使用Hive来进行分析,因为您可能更熟悉类似SQL的语法。 Pig是更纯粹的关系代数,语法不像SQL,它更偏向于偏好,但两种方法都应该正常工作。



由于您可以将数据导入Hive直接使用Sqoop,你的数据应该直接准备好在它被导入后处理。



在Hive中,你可以运行你的查询并告诉它将结果写入HDFS :

  hive -e插入覆盖目录/ path / to / output'select * from mytable ...


$ b

导出为TeraData



Cloudera去年发布了a Teradata for Sqoop连接器,如此处,所以你应该看看,因为这看起来像你想要的。这里是你如何做到这一点:

  sqoop export --connect jdbc:teradata:// localhost / DATABASE = MY_BASE  - username sqooptest --password xxxxx --table MY_DATA --export-dir / path / to / hive / output 



< hr>

无论你想要什么时间段,整个事情都是可行的,最终重要的是你的集群的大小,如果你想快速的扩大你的集群的规模,需要。 Hive和Sqoop的好处是处理将分布在您的群集中,因此您可以完全控制计划。


My requirement is to

  1. Move data from Oracle to HDFS
  2. Process the data on HDFS
  3. Move processed data to Teradata.

It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.

After searching a lot on the internet, i found that

  1. ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
  2. Do large scale processing either by Custom MapReduce or Hive or PIG.
  3. SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).

Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?

Other options that i found are the following

  1. STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
  2. Any open source ETL tools like Talend or Pentaho.

Please share your thoughts on these options as well and any other possibilities.

解决方案

Looks like you have several questions so let's try to break it down.

Importing in HDFS

It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:

sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir

For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.

Processing

As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.

Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.

In Hive you could run your query and tell it to write the results in HDFS:

hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."

Exporting into TeraData

Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:

sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output


The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.

这篇关于将数据从oracle移动到HDFS,处理并从HDFS移动到Teradata的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆