Hive:如何执行SELECT查询以使用HiveQL输出唯一主键? [英] Hive: How to do a SELECT query to output a unique primary key using HiveQL?
问题描述
我有以下模式数据集,我想将其转换为可导出到SQL的表。我正在使用 HIVE
。输入如下
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
输出表需要有 call_id
作为主键,因此它需要是唯一的。输出模式应该是
call_id,stat2,stat3,
1,b,c或(1,k ,l)
2,y,z,
3,e,f,
<问题是,当我在 HIVE
查询中使用关键字 DISTINCT
时, DISTINCT
适用于所有组合的柱。我只想将DISTINCT操作应用于 call_id
。
SELECT LINE DISTINCT(call_id),stat2,stat3 from intable;
然而,这在 HIVE
(我不熟悉SQL)。
唯一合法的查询似乎是
SELECT DISTINCT来自intable的call_id,stat2,stat3;
但是,这会返回多行,同样的 call_id
因为其他列是不同的,整个行是不同的。
注意:a,b,c,x,y,z之间没有算术关系,等等。所以任何平均或求和的技巧都是不可行的。
任何想法我可以做到这一点?
一个简单的想法,不是最好的,但会完成工作 -
$ b
hive> create table temp1(int, b字符串);
$ b hive>插入覆盖表temp1
select call_id,max(concat(stat1,' |',stat2,'|',stat3))from intable group by call_id;
hive>插入覆盖表intable
从temp1中选择一个split(b,'|')[0],split(b,'|')[1],split(b,'|')[2];
I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE
. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id
as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT
in the HIVE
query, the DISTINCT
applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id
. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE
(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id
as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
这篇关于Hive:如何执行SELECT查询以使用HiveQL输出唯一主键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!