Hdp,Hive,横向视图和null:消失的行 [英] Hdp, Hive, Lateral view and null: disappearing rows

查看:131
本文介绍了Hdp,Hive,横向视图和null:消失的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于从hdp 3.1.0升级到3.1.4,我在Hive中遇到了一些我不理解的问题.请注意,我仅使用ORC事务表.

Since the upgrade from hdp 3.1.0 to 3.1.4, I have some issue in Hive I do not understand. Note that I am only using ORC transactional tables.

例如以下查询:

with cte as (
  select
      e.id
    , '{}' as json
  from event e
)
-- select count(*) from cte
select 
    id
  , lv.customfield
from cte
lateral view outer
  json_tuple(cte.json, 'customfield') cv AS `customfield`

在升级之前,它运行良好.

It worked perfectly before the upgrade.

现在,即使CTE返回一定数量的行,使用横向视图也只会从结果集中删除行,而不会出现任何错误,而CTE之外没有多余的where子句(在我的实际示例中,查询返回66行(没有侧面视图,但只有19行).

Now, even if the CTE returns a certain number of rows, using the lateral view will just drop rows from the resultset, without any error, whereas there is no extra where clause outside the CTE (in my real example, the query returns 66 rows without the lateral view, but only 19 with).

就我而言,我有:

  • select count(*)给我66行
  • 添加静态字符串的侧面视图 时,我只有19行.
  • select count(*) give me 66 rows
  • when the lateral view on a static string is added, I only get 19 rows.

我尝试了很多变化:

  • 如果我用静态CTE(select stack(1, ...))替换事件表,我会得到预期的结果
  • 如果我删除侧面视图,则可以得到预期的行数(只要不使用,则与之不同)
  • 如果我创建并使用临时表而不是CTE,结果不会改变.
  • 如果我将json_tuple(cte.json, 'customfield')放在CTE之外的选择部分中(并且没有其他内容,因为那将是无效的),没有lateral view,我就有了预期的行数,
  • 如果我在CTE外部的选定零件中使用get_json_object(并且没有侧面视图),那么我会得到预期的结果.
  • 当然,配置单元(服务器或元存储)日志中没有任何内容.
  • 作为补充,自升级以来,merge语句[保持生成重复项] [1],而之前效果很好.
  • if I replace the event table by a static CTE (select stack(1, ...)) I have the result I expect
  • if I remove the lateral view, I have the number of rows I expect (as long as I do not use is distinct from)
  • if instead of a CTE I create and use a temporary table, the outcome does not change.
  • if I put json_tuple(cte.json, 'customfield') in the select part outside the CTE (and nothing else as it would not be valid), without the lateral view, I have the number of expected rows,
  • If I use get_json_object in the select part outside the CTE (and no lateral view) I have the expected results.
  • of course, there is nothing in the hive (server or metastore) logs.
  • as a side note, since the upgrade a merge statement [keeps generating duplicates][1], whereas it worked perfectly before.

另一个非常令人惊讶的事情是CTE内部有一个if语句,例如:if(is_deleted is null, 'true', 'false'). 如果我将is null替换为is not distinct from null,这应该是完全有效的,那么CTE不会返回任何行.

Another extremely surprising thing is that inside the CTE there is an if statement, for instance: if(is_deleted is null, 'true', 'false'). If I replace the is null with is not distinct from null, which should be perfectly valid, no rows are returned by the CTE.

我完全不知所措,我不知道为什么会这样以及如何信任蜂巢.

I am completely at loss and I have no idea why this happens and how I can trust hive. 

我无法通过生成手动数据来复制错误,因此我无法给出一个(不可行的)示例.

I cannot replicate the error by generating manual data so I cannot give a (not) working example.

推荐答案

我尚不了解的实际原因,但是我可以找出问题所在,并可以提交错误报告:

The actual reason I do not understand yet, but I could isolate the problem and could actually submit a bug report: https://issues.apache.org/jira/browse/HIVE-22500

简而言之,如果涉及排序依据(隐式或显式),则小于或等于隐式字符串转换为时间戳将失败.

In short, a lesser than or equals with implicit string conversion to timestamp fails if a sort by (implicit or explicit) is involved.

-- valid result
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00') t;
-- invalid result
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t;

您可以查看有关完整设置或其他示例的错误报告.解决方法是将字符串显式转换为时间戳.

You can see the bug report for full set up or other examples. The workaround is to explicitly cast the string to a timestamp.

这篇关于Hdp,Hive,横向视图和null:消失的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆