Hive UDF用于选择除某些列外的所有内容 [英] Hive UDF for selecting all except some columns
问题描述
SELECT *
)或显式指定的一组列( SELECT A,B,C
)。 SQL没有内置机制来选择除指定的一组列之外的所有其他内容。 有几种排除某些列的机制,如这个SO问题,但没有一个适用于HiveQL。 (例如,使用 SELECT *
然后 ALTER TABLE DROP
创建一个临时表的想法在大数据环境中大肆破坏。)
忽略关于选择除列以外的所有列的好主意的意识形态讨论,这个问题是关于可能的扩展方法Hive具有此功能。
在Hive 0.13.0之前,SELECT可以采用基于正则表达式的列,例如 property _。*
在反引号字符串中。 @ invoketheshell的答案指的是这个能力,但是它的代价是,当这个能力开启的时候,Hive不能接受其中包含非标准字符的列,例如 $ foo
或 x / y
。这就是Hive开发者在0.13.0中默认关闭此行为的原因。我正在寻找一种适用于任何列名称的通用解决方案。
$ b
通用表格生成UDF( UDTF )当然可以这样做,因为它可以操作模式。由于我们不会生成新行,有没有办法使用简单的基于行的UDF来解决这个问题?
这似乎是许多帖子的常见问题围绕Web展示如何为各种数据库解决问题,但我一直无法找到Hive的解决方案。有没有代码可以做到这一点?
您可以选择除了基于正则表达式的规范中列出的每一列。这是排除查询列。请参阅下面的内容:
SELECT语句可以在0.13.0之前的Hive发行版中使用基于regex的列规范,如果配置属性hive在0.13.0和更高发行版中.support.quoted.identifiers设置为none。
也就是说,您可以使用以下命令创建新的表或视图,除指定的列以外的所有列返回:
hive.support.quoted.identifiers = none;
drop table if exists database.table_name;
创建表(如果不存在)database.table_name as
选择`(column_to_remove_1 | ... | column_to_remove_N)+。+`
from database.some_table
where
--...
;
这将创建一个表,其中包含来自some_table的所有列,但列名为column_to_remove_1的列除外...... ,到column_to_remove_N。您也可以选择创建一个视图。
The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *
) or an explicitly-specified set of columns (SELECT A, B, C
). SQL has no built-in mechanism for selecting all but a specified set of columns.
There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT *
then ALTER TABLE DROP
some of its columns would wreak havoc in a big data environment.)
Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.
Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.*
inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo
or x/y
. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.
A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?
This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?
You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:
A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.
That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:
hive.support.quoted.identifiers=none;
drop table if exists database.table_name;
create table if not exists database.table_name as
select `(column_to_remove_1|...|column_to_remove_N)?+.+`
from database.some_table
where
--...
;
This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.
这篇关于Hive UDF用于选择除某些列外的所有内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!