在不同组中的数据类型之间进行比较时,如何处理潜在的数据丢失 [英] How to handle potential data loss when performing comparisons across data types in different groups

查看:326
本文介绍了在不同组中的数据类型之间进行比较时,如何处理潜在的数据丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:
我们小组正在将Cloudera升级到6.1.1,我的任务是确定如何处理跨数据类型的隐式数据类型转换的损失。有关相关的发行说明的详细信息,请参见下面的链接。



https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_turnion >>

此问题不仅会影响UNION ALL查询,而且还有一个函数可以对不同数据类型(即,STRING到BIGINT)的列进行比较。



该小组决定我们不想更改基础表元数据。因此解决方案是通过使用CAST()函数强制转换数据来允许潜在的数据丢失。对于UNION ALL,我们将转换为目标表的元数据。但是,在进行比较时,我试图确定最简单,最简单的方法来进行比较而不会得到错误的结果。



问题:



我可以简单地将所有内容强制转换为STRING或VARCHAR( )进行比较时?是否存在任何可能导致错误结果的潜在问题?



更新:
如果此方法存在问题,是否有正确的解决方案?



注意:这是我第一次与Hadoop / HIVE合作,并且我了解到我在RDBMS领域中所了解的一切并不总是适用。

解决方案

您可能会遇到问题。例如,如果将字符串与整数进行比较,则:




  • '1.00'= 1 -> true,因为将值比较为数字



但是作为字符串:




  • '1.00'='1'->否,因为将值作为字符串进行比较



我认为您会遇到类似的日期问题。


Background: Our group is going through a Cloudera upgrade to 6.1.1 and I have been tasked with determining how to handle the loss of the implicit data type conversion across data types. See link below for the relevant Release Note details.

https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_union_all_returns_incorrect_data

Not only does this issue affect UNION ALL queries, but there is a function that performs comparisons on columns of different data types (i.e, STRING to BIGINT).

The group has decided that we do not want to change the underlying table meta data. So the solution is to allow for potential data loss by using the CAST() function to cast the data. In the case of UNION ALL, we cast to the destination table's meta data. But, when performing comparisons, I am trying to determine the simplest and easiest way to perform comparisons without getting erroneous results.

Question:

Can I simply cast everything to either STRING or VARCHAR() when performing the comparison? Are there any potential problems that might create incorrect results?

Update: If there are problems with this approach, is there a correct solution to handle this?

Note: this is my first engagement working with Hadoop/HIVE and I have learned that everything I know in RDBMS land does not always apply.

解决方案

It is possible that you will have problems. For instance, if comparing a string to an int, then:

  • '1.00' = 1 --> true, because the values are compared as numbers

But as strings:

  • '1.00' = '1' --> false, because the values are compared as strings

You can get similar issues with dates, I think.

这篇关于在不同组中的数据类型之间进行比较时,如何处理潜在的数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆