Sqoop导入按列数据类型分割 [英] Sqoop Import Split by Column Data type

查看:1037
本文介绍了Sqoop导入按列数据类型分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

sqoop import中Split by列的数据类型是否应该是一个数字数据类型(integer,bignint,numeric)?不能是字符串吗?

解决方案

是的,你可以分割任何非数字数据类型。 b

但不建议这样做。



为什么?



用于分割数据Sqoop触发

  SELECT MIN(col1),MAX(col2)FROM TABLE 

然后根据你的mappers数量来分配它。



现在以 - split-by 列的整数为例

Table有一些值为1到100的 id 列,并且您在您的系统中使用4个映射器( -m 4 >) sqoop命令)



Sqoop使用以下命令获取MIN和MAX值:

  SELECT MIN(id),MAX(id)FROM TABLE 

OUTPUT:



<1,100



整数分割很容易。您将制作4个部分:


  • 1-25

  • 25-50

  • 51-75
  • 76-100



现在字符串为 - 拆分

表中有一些名称将值dev设置为sam,并在sqoop命令中使用4个映射器( -m 4



Sqoop得到MIN和MAX值:

$ p $ SELECT MIN(id),MAX(id )FROM TABLE

输出:



dev ,山姆



现在将如何分成4部分。按照sqoop docs

  / ** 
*此方法需要确定两个用户提供的分割
*字符串。在用户的字符串是'A'和'Z'的情况下,这是
*不难;我们可以从['A','M')和['M','Z']创建两个分割,26
*分割为以每个字母开头的字符串等。
*
*如果用户向我们提供了字符串Ham和Haze,但是,我们
*需要在第三个字母中创建不同的分割。
*
*使用的算法如下:
*由于有2 ** 16个unicode字符,因此我们将字符解释为基数为65536的
*位。给定一个字符串' s'包含字符s_0,s_1
* .. s_n,我们将该字符串解释为数字:0.s_0 s_1 s_2 .. s_n in
* base 65536.已将低位和高位字符串映射为浮点点
*值,然后我们使用BigDecimalSplitter建立偶数分割
*点,然后将得到的浮点值映射回字符串。
* /

您将在代码中看到警告:

  LOG.warn(生成文本索引列的分割。 
LOG.warn(如果数据库以不区分大小写的顺序排序,
+这可能会导致部分导入或重复记录。
LOG.warn(强烈建议您选择一个完整的拆分列。);

在整数示例中,所有映射器将获得均衡加载(全部将获取25来自RDBMS的记录)

在字符串的情况下,数据排序的可能性较小。因此,很难给所有映射器提供类似的加载。






简而言之,将整数列转换为 - split-by 栏。


Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?

解决方案

Yes you can split on any non numeric datatype.

But this is not recommended.

WHY?

For splitting data Sqoop fires

SELECT MIN(col1), MAX(col2) FROM TABLE

then divide it as per you number of mappers.

Now take an example of integer as --split-by column

Table has some id column having value 1 to 100 and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

1,100

Splitting on integer is easy. You will make 4 parts:

  • 1-25
  • 25-50
  • 51-75
  • 76-100

Now string as --split-by column

Table has some name column having value "dev" to "sam" and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

dev,sam

Now how will it be divided in 4 parts. As per sqoop docs,

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

And you will see the warning in the code:

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).

In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.


In a nutshell, Go for integer column as --split-by column.

这篇关于Sqoop导入按列数据类型分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆