从多个服务器加载数据时避免数据复制 [英] Avoiding Data Duplication when Loading Data from Multiple Servers

查看:142
本文介绍了从多个服务器加载数据时避免数据复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有十几台Web服务器,每台服务器都将数据写入日志文件。在每个小时的开始,使用运行命令的cron脚本将前一小时的数据加载到配置单元中:

  hive -eLOAD DATA LOCAL INPATH'myfile.log'INTO TABLE my_table PARTITION(dt ='2015-08-17-05')

在某些情况下,命令失败并以0以外的代码退出,在这种情况下,我们的脚本将等待并再次尝试。问题是,在某些情况下出现故障时,即使数据加载显示失败消息,数据加载也不会 失败。



加载数据的这种失败的示例:如何确定数据是否已加载?
$ b


将数据加载到表default.my_table分区(dt = 2015-08-17-05)
由于异常$ b失败$ b org.apache.hadoop.hive.ql.metadata.HiveException:无法更改
分区。 FAILED:执行错误,从
返回代码1 org.apache.hadoop.hive.ql.exec.MoveTask

编辑:
或者,是否有一种方法可以查询加载到其中的文件名的配置单元?我可以使用 DESCRIBE 查看文件数量。我可以知道他们的名字吗?

解决方案

关于哪些文件已被加载到分区中如果您使用了 EXTERNAL TABLE 并只将您的原始数据
文件上载到HDFS目录中,则b


  • 映射到 LOCATION ,那么你可以


(a) (或使用等效的Java API调用)
(b)运行一个Hive查询,例如 code> hdfs dfs -ls 从(...)




  • 中选择不同的INPUT__FILE__NAME,但在您的情况下,您将数据复制到托管表,因此
    无法检索数据沿袭(即使用
    创建每个托管数据文件的日志文件)
  • ......除非您明确地在
    course的日志文件(在特殊标题记录或每个记录的开头处)内添加原始文件名称可以用旧的 sed




    关于如何自动避免重复INSERT :有一种方法,但需要相当多的重新设计,并且会花费你的处理时间/(额外的Map步骤加上MapJoin)/ ...


    1. 将您的日志文件映射到 EXTERNAL TABLE ,以便您可以运行
      INSERT-SELECT查询

    2. 使用 INPUT__FILE__NAME 伪列作为源将原始文件名上载到您的托管表中
    3. 添加一个 WHERE NOT EXISTS 子句与相关的子查询,这样如果源文件名已经存在于目标中,那么您不会再加载任何内容。 b

      INSERT INTO TABLE目标
      选择ColA,ColB,ColC,INPUT__FILE__NAME AS SrcFileName
      源src
      不存在
      (SELECT DISTINCT 1
      FROM Target trg
      WHERE trg.SrcFileName = src.INPUT__FILE__NAME



      注意这个愚蠢的D ISTINCT实际上是为了避免在Mappers中浪费RAM;这对于像Oracle这样的成熟的DBMS来说是没有用的,但Hive优化器仍然是相当粗糙的...



    I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:

    hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
    

    In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?

    Example for such a "failure" where the data is loaded:

    Loading data to table default.my_table partition (dt=2015-08-17-05) Failed with exception org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter partition. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

    Edit: Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?

    解决方案

    About "which files have been loaded in a partition":

    • if you had used an EXTERNAL TABLE and just uploaded your raw data file in the HDFS directory mapped to LOCATION, then you could

    (a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call) (b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)

    • but in your case, you copy the data into a "managed" table, so there is no way to retrieve the data lineage (i.e. which log file was used to create each managed datafile)
    • ...unless you add explicitly the original file name inside the log file, of course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)

    About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...

    1. map your log file to an EXTERNAL TABLE so that you can run an INSERT-SELECT query
    2. upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
    3. add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more

      INSERT INTO TABLE Target SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName FROM Source src WHERE NOT EXISTS (SELECT DISTINCT 1 FROM Target trg WHERE trg.SrcFileName =src.INPUT__FILE__NAME )

      Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...

    这篇关于从多个服务器加载数据时避免数据复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆