在Hive中使用横向视图时出现异常 [英] Exception while using lateral view in Hive

查看:217
本文介绍了在Hive中使用横向视图时出现异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用下面的代码来解析Hive中的xml数据。在我的xml数据中,有几个标签正在重复使用,所以我使用brickhouse jar和lateral view来解析标签并放置在Hive表格中。但是当我执行我的代码时,出现错误。请帮助,因为我无法理解我做错了什么。



代码:

  add jar /home/cloudera/brickhouse-0.5.5.jar; 
CREATE TEMPORARY FUNCTION numeric_range AS'brickhouse.udf.collect.NumericRange';
CREATE TEMPORARY FUNCTION array_index AS'brickhouse.udf.collect.ArrayIndexUDF';
add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;
set hive.exec.mode.local.auto = false;
DROP TABLE IF EXTERNTS medinfo2;
创建表medinfo2为
select array_index(statusCode,n)AS statusCode,
array_index(startTime,n)AS startTime,
array_index(endTime,n)AS endTime,
array_index(strengthValue,n)AS strengthValue,
array_index(strengthUnits,n)AS strengthUnits
from medications_info7 lateral view numeric_range(size(statusCode))n1 as n;

错误:


导致:java.lang.IndexOutOfBoundsException:索引:7,大小:7 $ b $在java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at com.ibm.spss.hive.serde2.xml.objectinspector.XmlListObjectInspector.getListElement(XmlListObjectInspector.java:79)
at brickhouse。 udf.collect.ArrayIndexUDF.evaluate(ArrayIndexUDF.java:59)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
at org.apache。 hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorHead._evaluate(ExprNodeEvaluatorHead.java:44)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.q l.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77)
... 25 more
$ b 失败:执行错误,从org.apache.hadoop.hive.ql.exec.mr.MapRedTask返回代码2
MapReduce作业发布:
Stage-Stage-1:映射:1 HDFS读取:0 HDFS写入:0 FAIL
Total MapReduce CPU使用时间:0毫秒

示例:

 < document> 
< code> 10160-0< / code>
< entryInfo>
< statusCode>完成< / statusCode>
< startTime> 20110729< / startTime>
< endTime> 20110822< / endTime>
< strengthValue> 24< / strengthValue>
< strengthUnits> h< / strengthUnits>
< / entryInfo>
< entryInfo>
< statusCode>完成< / statusCode>
< startTime> 20120130< / startTime>
< endTime> 20120326< / endTime>
< strengthValue> 12< / strengthValue>
< strengthUnits> h< / strengthUnits>
< / entryInfo>
< entryInfo>
< statusCode>完成< / statusCode>
< startTime> 20100412< / startTime>
< endTime> 20110822< / endTime>
< strengthValue> 8< / strengthValue>
< strengthUnits> d< / strengthUnits>
< / entryInfo>
< / document>

我的实际样本很大,包含很多重复的标签。

解决方案

我不知道您的数据在Hive中的样子是什么,因为您没有提供这些信息,所以下面是我如何加载XML进入Hive。



Loader

  ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar; 

DROP TABLE IF EXISTS db.tbl;
CREATE TABLE IF NOT EXISTS db.tbl(
code STRING,
entryInfo ARRAY< MAP< STRING,STRING>>

ROW FORMAT SERDE'com。 ibm.spss.hive.serde2.xml.XmlSerde'
WITH SERDEPROPERTIES(
column.xpath.code=/ document / code / text(),
column.xpath .entryInfo=/ document / entryInfo / *

STORED AS
INPUTFORMAT'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT'org .apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES(
xmlinput.start=< document>,
xmlinput.end=< /文件>
);

LOAD DATA LOCAL INPATH'someFile.xml'INTO TABLE db.tbl;

3 - Arrays 一节中的Hive-XML-SerDe 文档中,您可以看到它们使用数组结构处理重复的标签,并在 4 - 地图中,您可以看到他们使用地图处理子标签下的条目。因此, entryInfo 将是类型阵列< MAP< STRING,STRING>>



然后,您可以展开此数组,收集类似键/ val的数据,然后重新合并。

查询

  ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar; 
ADD JAR /path/to/jars/brickhouse-0.7.1.jars;

CREATE TEMPORARY FUNCTION COLLECT as'brickhouse.udf.collect.CollectUDAF';

选择代码
,m_map ['statusCode'] AS status_code
,m_map ['startTime'] AS start_time
,m_map ['endTime'] AS end_time
,m_map ['strengthValue'] AS strength_value
,m_map ['strengthUnits'] AS strength_units
FROM(
SELECT code
,COLLECT(m_keys,m_vals)AS m_map
FROM(
SELECT code
,idx
,MAP_KEYS(entry_info_map)[0] AS m_keys
,MAP_VALUES(entry_info_map)[0] AS m_vals
FROM(
SELECT code
,entry_info_map
,CASE
当楼层(tmp / 5)= 0 THEN 0
当楼层(tmp / 5)= 1 THEN 1
FLOOR(tmp / 5)= 2 THEN 2
ELSE -1
END as idx
FROM db.tbl
LATERAL VIEW POSEXPLODE(entryInfo)exptbl AS tmp ,entry_info_map)x)y
GROUP BY代码,idx)z

输出

 代码status_code start_time end_time strength_value strength_units 
10160-0已完成20110729 20110822 24 h
10160-0已完成20120130 20120326 12 h
10160- 0已完成20100412 20110822 8 d

此外,基本上这个问题已经提出了4次(一个两个三个四个)。这不是一个好主意。只需询问一次,编辑即可添加更多信息,并保持耐心。

I am using the below code to parse xml data in Hive. In my xml data, a few tags are repeating so I am using the brickhouse jar and lateral view to parse the tags and place in Hive tables. But when I am executing my code, I am getting an error. Please help as I am not able to understand what I am doing wrong.

Code:

add jar /home/cloudera/brickhouse-0.5.5.jar;
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';
CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;
set hive.exec.mode.local.auto=false;
DROP TABLE IF EXISTS medinfo2;
create table medinfo2 as
select array_index(statusCode,n) AS statusCode,
    array_index(startTime,n) AS startTime,
    array_index(endTime,n) AS endTime,
    array_index(strengthValue,n) AS strengthValue,
    array_index(strengthUnits,n) AS strengthUnits
from medications_info7 lateral view numeric_range(size( statusCode )) n1 as n;

Error:

Caused by: java.lang.IndexOutOfBoundsException: Index: 7, Size: 7 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at com.ibm.spss.hive.serde2.xml.objectinspector.XmlListObjectInspector.getListElement(XmlListObjectInspector.java:79) at brickhouse.udf.collect.ArrayIndexUDF.evaluate(ArrayIndexUDF.java:59) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorHead._evaluate(ExprNodeEvaluatorHead.java:44) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77) ... 25 more

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec

Sample:

<document>
 <code>10160-0</code>
 <entryInfo> 
    <statusCode>completed</statusCode>
    <startTime>20110729</startTime>
    <endTime>20110822</endTime>
    <strengthValue>24</strengthValue>
    <strengthUnits>h</strengthUnits>
 </entryInfo> 
 <entryInfo>
    <statusCode>completed</statusCode>
    <startTime>20120130</startTime>
    <endTime>20120326</endTime>
    <strengthValue>12</strengthValue>
    <strengthUnits>h</strengthUnits>
 </entryInfo>
 <entryInfo>
    <statusCode>completed</statusCode>
    <startTime>20100412</startTime>
    <endTime>20110822</endTime>
    <strengthValue>8</strengthValue>
    <strengthUnits>d</strengthUnits>
 </entryInfo>  
</document>

My actual sample is huge in size and contains a lot of these tags which are repeated.

解决方案

I don't know what your data looks like in Hive because you didn't provide that information so here is how I loaded your XML into Hive.

Loader:

ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;

DROP TABLE IF EXISTS db.tbl;
CREATE TABLE IF NOT EXISTS db.tbl (
  code STRING,
  entryInfo ARRAY<MAP<STRING,STRING>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerde'
WITH SERDEPROPERTIES (
  "column.xpath.code"="/document/code/text()",
  "column.xpath.entryInfo"="/document/entryInfo/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  "xmlinput.start"="<document>",
  "xmlinput.end"="</document>"
);

LOAD DATA LOCAL INPATH 'someFile.xml' INTO TABLE db.tbl;

In the Hive-XML-SerDe documentation under section 3 - Arrays, you can see that they use an array structure to handle repeated tags and in 4 - Maps, you can see that they use maps to handle entries under a sub-tag. So, entryInfo will be of type ARRAY<MAP<STRING,STRING>>.

You can then explode this array, collect like key/vals, and re-combine.

Query:

ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
ADD JAR /path/to/jars/brickhouse-0.7.1.jars;

CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';

SELECT code
  , m_map['statusCode']    AS status_code
  , m_map['startTime']     AS start_time
  , m_map['endTime']       AS end_time
  , m_map['strengthValue'] AS strength_value
  , m_map['strengthUnits'] AS strength_units
FROM (
  SELECT code
    , COLLECT(m_keys, m_vals) AS m_map
  FROM (
    SELECT code
      , idx
      , MAP_KEYS(entry_info_map)[0]   AS m_keys
      , MAP_VALUES(entry_info_map)[0] AS m_vals
    FROM (
      SELECT code
        , entry_info_map
        , CASE
           WHEN FLOOR(tmp / 5) = 0 THEN 0
           WHEN FLOOR(tmp / 5) = 1 THEN 1
           WHEN FLOOR(tmp / 5) = 2 THEN 2
           ELSE -1
         END AS idx
      FROM db.tbl
      LATERAL VIEW POSEXPLODE(entryInfo) exptbl AS tmp, entry_info_map ) x ) y
  GROUP BY code, idx ) z

Output:

code    status_code     start_time      end_time    strength_value  strength_units
10160-0 completed       20110729        20110822    24              h
10160-0 completed       20120130        20120326    12              h
10160-0 completed       20100412        20110822    8               d

Also, you've basically asked this question 4 times (one, two, three, four). This is not a good idea. Just ask once, edit to add more information, and be patient.

这篇关于在Hive中使用横向视图时出现异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆