Sqoop导入按列数据类型拆分 [英] Sqoop Import Split by Column Data type

查看:29
本文介绍了Sqoop导入按列数据类型拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

sqoop 导入中按列拆分的数据类型是否应该始终是数字数据类型(整数、bignint、数字)?不能是字符串吗?

Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?

推荐答案

是的,您可以拆分任何非数字数据类型.

Yes you can split on any non numeric datatype.

但不推荐这样做.

用于拆分数据 Sqoop 触发

For splitting data Sqoop fires

SELECT MIN(col1), MAX(col2) FROM TABLE

然后根据您的映射器数量对其进行划分.

then divide it as per you number of mappers.

现在以--split-by列的整数为例

Now take an example of integer as --split-by column

表有一些 id 列的值从 1 到 100,并且您使用了 4 个映射器(-m 4 在您的 sqoop 命令中)

Table has some id column having value 1 to 100 and you using 4 mappers (-m 4 in your sqoop command)

Sqoop 使用以下方法获取 MIN 和 MAX 值:

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

输出:

1,100

对整数进行拆分很容易.您将制作 4 个部分:

Splitting on integer is easy. You will make 4 parts:

  • 1-25
  • 25-50
  • 51-75
  • 76-100

现在字符串为--split-by

表有一些 name 列的值dev"到sam",并且您使用了 4 个映射器(-m 4 在您的 sqoop 命令中)

Table has some name column having value "dev" to "sam" and you using 4 mappers (-m 4 in your sqoop command)

Sqoop 使用以下方法获取 MIN 和 MAX 值:

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

输出:

开发,山姆

现在如何将其分为 4 个部分.根据 sqoop #L47" rel="noreferrer"文档

Now how will it be divided in 4 parts. As per sqoop docs,

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

你会在代码中看到警告:

And you will see the warning in the code:

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

在整数示例的情况下,所有映射器将获得平衡负载(所有映射器将从 RDBMS 获取 25 条记录).

In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).

在字符串的情况下,数据被排序的可能性较小.因此,很难为所有映射器提供相似的负载.

In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.

简而言之,将整数列作为 --split-by 列.

In a nutshell, Go for integer column as --split-by column.

这篇关于Sqoop导入按列数据类型拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆