Sqoop导入 - 源表模式更改 [英] Sqoop import - Source table schema change

查看:218
本文介绍了Sqoop导入 - 源表模式更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设在任何关系数据库中都有一个名为T1的表,其中包含100列以上的表。我将sqoop作为CSV导入到HDFS中。



现在,表格T1中增加了10列。如果我将这些数据导入到HDFS中,新数据将会有比以前多10个列。



问题:


  1. sqoop如何排序正在导入的列,以便旧数据和新数据(至少在T1中更改之前的列)处于正确的位置?

  2. 如果某列被删除会怎么样?如何处理这种情况,即旧数据和新数据如何保留位置? / div>


    sqoop如何排序导入的列,以便旧数据和新数据(至少是T1之前的列)位于正确的位置? p>

    在将数据写入HDFS时,所有基于Hadoop的工具都不执行架构。默认情况下,它不会尝试用新字段更新旧数据。 Sqoop不知道HDFS中的数据列。对于新数据,这一切都取决于您如何编写sqoop导入命令。如果在 - columns 子句中使用 - table 子句,那么数据将按照来源。如果您发出 - query 子句来提供自定义查询来提取数据,那么订单将基于查询中select子句的列顺序。如果您不想明确提及列名作为sqoop导入的一部分,您可以考虑在源数据库上创建视图。


    使用新列,这些列总是在最后导入?


    不一定像前面解释过的那样


    如果某列被删除,该怎么办?如何处理这种情况,即旧数据和新数据如何保留位置?

    如果删除列,最有可能的是您必须重新加载数据或在处理时根据特定规则处理数据。更好的方法是重新加载数据或在源数据库上创建视图。

    这些不是sqoop的限制,它们是标准问题,需要定制解决方案不管你使用的技术如何。问题太普通了,因此获取API可能不太可行。


    Let's say that there is a table called T1 with 100+ columns in any relational database. I sqoop import this table into HDFS as CSV.

    Now 10 more columns are added to the table T1. If i import this data into HDFS, the new data would have 10 more columns than before.

    Questions:

    1. How does sqoop order the columns being imported, so that the old and the new data (at least for the columns before the change in T1) are at the right positions?

    2. With new columns, do these columns always get imported at the end?

    3. What if a column gets deleted? How to handle this situation i.e. how does the old data and the new data retain the positions?

    解决方案

    How does sqoop order the columns being imported, so that the old and the new data (at least for the columns before the change in T1) are at the right positions?

    All Hadoop based tools does not enforce schema while writing the data to HDFS. By default it will not try to update the old data with new fields. Sqoop is unaware of the columns of the data in HDFS. For new data, it all depends up on how you write sqoop import command. If you use --table clause with out --columns clause, then the data will be as per the order on the source. If you issue --query clause to provide custom query to fetch the data, the order will be based up on the column order of select clause in the query. If you do not want to explicitly mention column names as part of sqoop import, you can consider creating view on source database.

    With new columns, do these columns always get imported at the end?

    Not necessarily as I have explained previously

    What if a column gets deleted? How to handle this situation i.e. how does the old data and the new data retain the positions?

    If columns are deleted, most likely you have to reload the data or handle it at the time of processing based up on certain rules. Better approach is to reload the data or to create the view on the source database.

    These are not the limitations of sqoop it self, they are standard problems which require custom solution irrespective of the technology you are using. Problem is too generic and hence getting an API for it might not be feasible.

    这篇关于Sqoop导入 - 源表模式更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆