使用Sqoop从MySQL导入到Hive [英] Import from MySQL to Hive using Sqoop

查看:346
本文介绍了使用Sqoop从MySQL导入到Hive的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须通过Sqoop将MySQL表(具有复合主键)中的4亿行以上的行导入到PARTITIONED Hive表Hive中。该表具有两年的数据,列出发日期范围从20120605到20140605,以及一天的成千上万条记录。我需要根据出发日期对数据进行分区。



版本:

Apache Hadoop - 1.0。 4



Apache Hive - 0.9.0

Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0。 0



据我所知,有三种方法:


  1. MySQL - >非分区Hive表 - >从非分区Hive表插入分区Hive表

  2. MySQL - >分区Hive表$
  3. MySQL - >非分区Hive表 - > ALTER非分区Hive表添加PARTITION


    1. 当前痛苦的我在追踪

    2. 无法找到示例
    3. 语法规定将分区指定为键值对 - 在数百万个记录中无法想到所有分区的情况下不可行关键VALU e
      3。


    任何人都可以为方法2和方法3提供输入吗?

    解决方案

    如果这仍然是人们想要了解的东西,可以使用

      sqoop import --driver< driver name> - 连接<连接网址> - 用户名<用户名> -P  - 表员工--num-mappers<数字> --warehouse-dir< hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $ departure_date 

    补丁说明:

      sqoop import [所有其他常规命令行选项] --hive- partition-key ds --hive-partition-valuevalue

    一些限制:




    • 它只允许一个分区键/值

    • 硬编码分区键的类型为一个字符串

    • 使用hive 0.7中的自动分区功能,我们可能需要调整此选项,以便为键名指定一个命令行选项,并使用db表中的该列进行分区。


    I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date.

    The versions :

    Apache Hadoop - 1.0.4

    Apache Hive - 0.9.0

    Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0

    As per my knowledge, there are 3 approaches:

    1. MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table
    2. MySQL -> Partitioned Hive table
    3. MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION

      1. is the current painful one that I’m following

      2. I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example

      3. The syntax dictates to specify partitions as key value pairs – not feasible in case of millions of records where one cannot think of all the partition key-value pairs 3.

    Can anyone provide inputs for approaches 2 and 3?

    解决方案

    If this is still something people wanted to understand, they can use

    sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee  --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date
    

    Notes from the patch:

    sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"
    

    Some limitations:

    • It only allows for one partition key/value
    • hardcoded the type for the partition key to be a string
    • With auto partitioning in hive 0.7 we may want to adjust this to just have one command line option for the key name and use that column from the db table to partition.

    这篇关于使用Sqoop从MySQL导入到Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆