蜂巢:列名称中Concat的用法 [英] Hive: Usage of Concat in Column Name

查看:49
本文介绍了蜂巢:列名称中Concat的用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从具有列名称为year_2016,year_2017,year_2018等的表中获取数据.我不确定如何从该表中获取数据.数据如下:

I am trying to get data from a table that has column name as: year_2016, year_2017, year_2018 etc. I am not sure how to get the data from this table. The data looks like:

| count_of_accidents | year_2016 | year_2017 |year_2018 | 
|--------------------|-----------|-----------|----------|
| 15                 | 12        | 5         | 1        |
| 5                  | 10        | 6         | 18       |

我已经尝试过'concat'功能,但这并不能真正起作用.

I have tried 'concat' function but this doesn't really work.

我已经尝试过:

select SUM( count_of_accidents * concat('year_',year(regexp_replace('2018_1_1','_','-')))) 
from table_name;

列名(year_2017或year_2018等)将作为参数传递.因此,我真的无法像这样对列名称进行硬编码-

The column name (year_2017 or year_2018 etc) will be passed as a parameter. So, I am not really able to hardcode the column name like this-

select SUM( count_of_accidents * year_2018) from table_name;

有什么办法可以做到这一点?

Is there any way I can do this?

推荐答案

您可以使用应该计算参数并将其传递给hive脚本,列名中不支持诸如concat(),regexp_replace之类的函数.

Parameter should be calculated and passed to the hive script, no functions like concat(), regexp_replace are supported in the column names.

列别名也不适用于使用正则表达式提取的列:

Also column aliasing does not work for columns extracted using regular expressions:

 select t.number_of_incidents, `^${year_param}$` as year1 from test_t t;

引发异常:

失败:SemanticException [错误10004]:行1:30无效的表别名或列引用" ^ year_2018 $ " :(可能的列名称为:事件数,年份2016,年份2017,年份2018)

FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias or column reference '^year_2018$': (possible column names are: number_of_incidents, year_2016, year_2017, year_2018)

我找到了一种解决方法,可以使用带有空白数据集的并集对列进行别名,请参见此测试:

I found a workaround to alias a column using union all with empty dataset, see this test:

create table test_t(number_of_incidents int, year_2016 int, year_2017 int, year_2018 int);
insert into table test_t values(15, 12, 5, 1); --insert test data
insert into table test_t values(5,10,6,18);

--parameter, can be passed from outside the script from command line  
set hivevar:year_param=year_2018;

--enable regex columns and print column names
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;

--Alias column using UNION ALL with empty dataset
select sum(number_of_incidents*year1) incidents_year1 
  from
    (--UNION ALL with empty dataset to alias columns extracted
     select 0 number_of_incidents, 0 year1 where false --returns no rows because of false condition
     union all
     select t.number_of_incidents, `^${year_param}$` from test_t t
    )s;

结果:

OK
incidents_year1
105
Time taken: 38.003 seconds, Fetched: 1 row(s)

UNION ALL 中的第一个查询不会影响数据,因为它不返回任何行.但是它的列名称成为整个UNION ALL数据集的名称,并且可以在上层查询中使用.这个技巧有效.如果您发现使用regexp提取的别名列的别名更好的解决方法,请同时添加解决方案.

First query in the UNION ALL does not affect data because it returns no rows. But it's column names become the names of the whole UNION ALL dataset and can be used in the upper query. This trick works. If you will find a better workaround to alias columns extracted using regexp, please add your solution as well.

更新:

如果可以传递完整的column_name作为参数,则不需要正则表达式.Hive在执行查询之前按原样替换变量(不计算变量).仅当由于某些原因而无法传递完整的列名,并且像在原始查询中那样需要某些模式串联时,才使用regexp.查看此测试:

No need in regular expressions if you can pass full column_name as a parameter. Hive substitutes variables as is (does not calculate them) before query execution. Use regexp only if you can not pass full column name for some reason and like in the original query some pattern concatenation is needed. See this test:

--parameter, can be passed from outside the script from command line  
set hivevar:year_param=year_2018;

select sum(number_of_incidents*${year_param}) incidents_year1 from test_t t;

结果:

OK
incidents_year1
105
Time taken: 63.339 seconds, Fetched: 1 row(s)

这篇关于蜂巢:列名称中Concat的用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆