如何将 xml 文件加载到 Hive 中 [英] How to load xml file into Hive

查看:47
本文介绍了如何将 xml 文件加载到 Hive 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理 Hive 表,但遇到以下问题.我的 HDFS 中有超过 10 亿个 xml 文件.我想要做的是,每个 xml 文件都有 4 个不同的部分.现在我想为每个 xml 文件拆分和加载每个表中的每个部分

Im working on Hive tables im having the following problem. I am having more than 1 billion of xml files in my HDFS. What i want to do is, Each xml file having the 4 different sections. Now i want to split and load the each part in the each table for every xml file

示例:

            <?xml version='1.0' encoding='iso-8859-1'?>

            <section1>
                <id> 1233222 </id>
               // having lot of xml tages 
            </section1>

            <section2>
               // having lot of xml tages 
            </section2>

            <section3>
               // having lot of xml tages 
            </section3>

            <section4>
               // having lot of xml tages 
            </section4>

            </xml>

我有四张桌子

        section1Table

        id       section1    // fields 

        section2Table

        id       section2

        section3Table 

        id       section3

        section4Table

        id       section4

现在我想将数据拆分并加载到每个表中.

Now i want to split and load the data into each table.

我怎样才能做到这一点.谁能帮帮我

How can i achieve this . Can anyone help me

谢谢

更新

我已经尝试了以下

CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1';


SELECT xpath (name, '//section1') FROM test LIMIT 1 ;

但我收到以下错误

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"name":"<?xml version='1.0' encoding='iso-8859-1'?>"}

推荐答案

您有几个选择:

  • 将 XML 加载到带有字符串列的 Hive 表中,每行一个(例如 CREATE TABLE xmlfiles (id int, xmlfile string).然后使用 XPath UDF 处理 XML.
  • 既然您知道所需的 XPath(例如 //section1),请按照 本教程以通过 XPath 直接摄取到 Hive.
  • 按照此处的说明将您的 XML 映射到 Avro,因为SerDe 存在用于无缝 Avro-to-Hive 映射.
  • 使用 XPath 将数据存储在 HDFS 中的常规文本文件中,然后将其提取到 Hive 中.
  • Load the XML into a Hive table with a string column, one per row (e.g. CREATE TABLE xmlfiles (id int, xmlfile string). Then use an XPath UDF to do work on the XML.
  • Since you know the XPath's of what you want (e.g. //section1), follow the instructions in the second half of this tutorial to ingest directly into Hive via XPath.
  • Map your XML to Avro as described here because a SerDe exists for seamless Avro-to-Hive mapping.
  • Use XPath to store your data in a regular text file in HDFS and then ingest that into Hive.

这取决于您对这些方法的经验和舒适度.

It depends on your level of experience and comfort with these approaches.

这篇关于如何将 xml 文件加载到 Hive 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆