https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_turnion >>
此问题不仅会影响UNION ALL查询,而且还有一个函数可以对不同数据类型(即,STRING到BIGINT)的列进行比较。
该小组决定我们不想更改基础表元数据。因此解决方案是通过使用CAST()函数强制转换数据来允许潜在的数据丢失。对于UNION ALL,我们将转换为目标表的元数据。但是,在进行比较时,我试图确定最简单,最简单的方法来进行比较而不会得到错误的结果。
问题:
我可以简单地将所有内容强制转换为STRING或VARCHAR( )进行比较时?是否存在任何可能导致错误结果的潜在问题?
更新:
如果此方法存在问题,是否有正确的解决方案?
注意:这是我第一次与Hadoop / HIVE合作,并且我了解到我在RDBMS领域中所了解的一切并不总是适用。
解决方案
您可能会遇到问题。例如,如果将字符串与整数进行比较,则:
-
'1.00'= 1
-> true,因为将值比较为数字
但是作为字符串:
-
'1.00'='1'
->否,因为将值作为字符串进行比较
我认为您会遇到类似的日期问题。
Background:
Our group is going through a Cloudera upgrade to 6.1.1 and I have been tasked with determining how to handle the loss of the implicit data type conversion across data types. See link below for the relevant Release Note details.
https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_union_all_returns_incorrect_data
Not only does this issue affect UNION ALL queries, but there is a function that performs comparisons on columns of different data types (i.e, STRING to BIGINT).
The group has decided that we do not want to change the underlying table meta data. So the solution is to allow for potential data loss by using the CAST() function to cast the data. In the case of UNION ALL, we cast to the destination table's meta data. But, when performing comparisons, I am trying to determine the simplest and easiest way to perform comparisons without getting erroneous results.
Question:
Can I simply cast everything to either STRING or VARCHAR() when performing the comparison? Are there any potential problems that might create incorrect results?
Update:
If there are problems with this approach, is there a correct solution to handle this?
Note: this is my first engagement working with Hadoop/HIVE and I have learned that everything I know in RDBMS land does not always apply.
解决方案
It is possible that you will have problems. For instance, if comparing a string to an int, then:
'1.00' = 1
--> true, because the values are compared as numbers
But as strings:
'1.00' = '1'
--> false, because the values are compared as strings
You can get similar issues with dates, I think.
这篇关于在不同组中的数据类型之间进行比较时,如何处理潜在的数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!