Hive:在列名中使用 Concat [英] Hive: Usage of Concat in Column Name

查看:43
本文介绍了Hive:在列名中使用 Concat的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从列名为:year_2016、year_2017、year_2018 等的表中获取数据.我不确定如何从该表中获取数据.数据如下:

I am trying to get data from a table that has column name as: year_2016, year_2017, year_2018 etc. I am not sure how to get the data from this table. The data looks like:

| count_of_accidents | year_2016 | year_2017 |year_2018 | 
|--------------------|-----------|-----------|----------|
| 15                 | 12        | 5         | 1        |
| 5                  | 10        | 6         | 18       |

我已经尝试过 'concat' 函数,但这并没有真正起作用.

I have tried 'concat' function but this doesn't really work.

我已经尝试过:

select SUM( count_of_accidents * concat('year_',year(regexp_replace('2018_1_1','_','-')))) 
from table_name;

列名(year_2017 或 year_2018 等)将作为参数传递.所以,我真的无法像这样对列名进行硬编码 -

The column name (year_2017 or year_2018 etc) will be passed as a parameter. So, I am not really able to hardcode the column name like this-

select SUM( count_of_accidents * year_2018) from table_name;

有什么办法可以做到这一点吗?

Is there any way I can do this?

推荐答案

你可以使用 正则表达式.像这样:

You can do it using regular expressions. Like this:

--create test table
create table test_col(year_2018 string, year_2019 string);

set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;

--test select using hard-coded pattern
select year_2018, `(year_)2019` from test_col;
OK
year_2018       year_2019
Time taken: 0.862 seconds

--test pattern parameter
set hivevar:year_param=2019;

select year_2018, `(year_)${year_param}` from test_col;
OK
year_2018       year_2019
Time taken: 0.945 seconds

--two parameters
set hivevar:year_param1=2018;
set hivevar:year_param2=2019;

select `(year_)${year_param1}`, `(year_)${year_param2}`  from test_col t;
OK
year_2018       year_2019
Time taken: 0.159 seconds

--parameter contains full column_name and using more strict regexp pattern
set hivevar:year_param2=year_2019;

select `^${year_param2}$` from test_col t;
OK
year_2019
Time taken: 0.053 seconds

--select all columns using single pattern year_ and four digits
select `^year_[0-9]{4}$`  from test_col t;
OK
year_2018       year_2019

应计算参数并将其传递给 hive 脚本,列名中不支持 concat()、regexp_replace 等函数.

Parameter should be calculated and passed to the hive script, no functions like concat(), regexp_replace are supported in the column names.

列别名也不适用于使用正则表达式提取的列:

Also column aliasing does not work for columns extracted using regular expressions:

 select t.number_of_incidents, `^${year_param}$` as year1 from test_t t;

抛出异常:

FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias或列引用 '^year_2018$':(可能的列名是:number_of_incidents, year_2016, year_2017, year_2018)

FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias or column reference '^year_2018$': (possible column names are: number_of_incidents, year_2016, year_2017, year_2018)

我找到了一种使用 union all 和空数据集对列进行别名的解决方法,请参阅此测试:

I found a workaround to alias a column using union all with empty dataset, see this test:

create table test_t(number_of_incidents int, year_2016 int, year_2017 int, year_2018 int);
insert into table test_t values(15, 12, 5, 1); --insert test data
insert into table test_t values(5,10,6,18);

--parameter, can be passed from outside the script from command line  
set hivevar:year_param=year_2018;

--enable regex columns and print column names
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;

--Alias column using UNION ALL with empty dataset
select sum(number_of_incidents*year1) incidents_year1 
  from
    (--UNION ALL with empty dataset to alias columns extracted
     select 0 number_of_incidents, 0 year1 where false --returns no rows because of false condition
     union all
     select t.number_of_incidents, `^${year_param}$` from test_t t
    )s;

结果:

OK
incidents_year1
105
Time taken: 38.003 seconds, Fetched: 1 row(s)

UNION ALL 中的第一个查询不会影响数据,因为它不返回任何行.但是它的列名变成了整个 UNION ALL 数据集的名称,可以在上层查询中使用.这个技巧有效.如果您会找到更好的解决方法来使用正则表达式提取列的别名,请同时添加您的解决方案.

First query in the UNION ALL does not affect data because it returns no rows. But it's column names become the names of the whole UNION ALL dataset and can be used in the upper query. This trick works. If you will find a better workaround to alias columns extracted using regexp, please add your solution as well.

更新:

如果您可以将完整的 column_name 作为参数传递,则不需要正则表达式.Hive 在查询执行之前按原样替换变量(不计算它们).仅当由于某种原因无法传递完整的列名时才使用正则表达式,并且在原始查询中需要一些模式连接.看到这个测试:

No need in regular expressions if you can pass full column_name as a parameter. Hive substitutes variables as is (does not calculate them) before query execution. Use regexp only if you can not pass full column name for some reason and like in the original query some pattern concatenation is needed. See this test:

--parameter, can be passed from outside the script from command line  
set hivevar:year_param=year_2018;

select sum(number_of_incidents*${year_param}) incidents_year1 from test_t t;

结果:

OK
incidents_year1
105
Time taken: 63.339 seconds, Fetched: 1 row(s)

这篇关于Hive:在列名中使用 Concat的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆