使用Sqoop从MySQL导入到Hive [英] Import from MySQL to Hive using Sqoop
问题描述
我必须通过Sqoop将MySQL表(具有复合主键)中的4亿行以上的行导入到PARTITIONED Hive表Hive中。该表具有两年的数据,列出发日期范围从20120605到20140605,以及一天的成千上万条记录。我需要根据出发日期对数据进行分区。
版本:
Apache Hadoop - 1.0。 4
Apache Hive - 0.9.0
Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0。 0
据我所知,有三种方法:
- MySQL - >非分区Hive表 - >从非分区Hive表插入分区Hive表
- MySQL - >分区Hive表$
MySQL - >非分区Hive表 - > ALTER非分区Hive表添加PARTITION
-
是当前痛苦的我在追踪
无法找到示例 -
语法规定将分区指定为键值对 - 在数百万个记录中无法想到所有分区的情况下不可行关键VALU e
3。
任何人都可以为方法2和方法3提供输入吗?
如果这仍然是人们想要了解的东西,可以使用
sqoop import --driver< driver name> - 连接<连接网址> - 用户名<用户名> -P - 表员工--num-mappers<数字> --warehouse-dir< hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $ departure_date
补丁说明:
sqoop import [所有其他常规命令行选项] --hive- partition-key ds --hive-partition-valuevalue
一些限制:
- 它只允许一个分区键/值
- 硬编码分区键的类型为一个字符串
- 使用hive 0.7中的自动分区功能,我们可能需要调整此选项,以便为键名指定一个命令行选项,并使用db表中的该列进行分区。
I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date.
The versions :
Apache Hadoop - 1.0.4
Apache Hive - 0.9.0
Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0
As per my knowledge, there are 3 approaches:
- MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table
- MySQL -> Partitioned Hive table
MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION
is the current painful one that I’m following
I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example
The syntax dictates to specify partitions as key value pairs – not feasible in case of millions of records where one cannot think of all the partition key-value pairs 3.
Can anyone provide inputs for approaches 2 and 3?
If this is still something people wanted to understand, they can use
sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date
Notes from the patch:
sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"
Some limitations:
- It only allows for one partition key/value
- hardcoded the type for the partition key to be a string
- With auto partitioning in hive 0.7 we may want to adjust this to just have one command line option for the key name and use that column from the db table to partition.
这篇关于使用Sqoop从MySQL导入到Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!