Hive UDF用于选择除某些列外的所有内容 [英] Hive UDF for selecting all except some columns

查看:1079
本文介绍了Hive UDF用于选择除某些列外的所有内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HiveQL中的常见查询构建模式(通常是SQL)是选择所有列( SELECT * )或显式指定的一组列( SELECT A,B,C )。 SQL没有内置机制来选择除指定的一组列之外的所有其他内容。

有几种排除某些列的机制,如这个SO问题,但没有一个适用于HiveQL。 (例如,使用 SELECT * 然后 ALTER TABLE DROP 创建一个临时表的想法在大数据环境中大肆破坏。)



忽略关于选择除列以外的所有列的好主意的意识形态讨论,这个问题是关于可能的扩展方法Hive具有此功能。



在Hive 0.13.0之前,SELECT可以采用基于正则表达式的列,例如 property _。* 在反引号字符串中。 @ invoketheshell的答案指的是这个能力,但是它的代价是,当这个能力开启的时候,Hive不能接受其中包含非标准字符的列,例如 $ foo x / y 。这就是Hive开发者在0.13.0中默认关闭此行为的原因。我正在寻找一种适用于任何列名称的通用解决方案。


$ b

通用表格生成UDF( UDTF )当然可以这样做,因为它可以操作模式。由于我们不会生成新行,有没有办法使用简单的基于行的UDF来解决这个问题?



这似乎是许多帖子的常见问题围绕Web展示如何为各种数据库解决问题,但我一直无法找到Hive的解决方案。有没有代码可以做到这一点?

解决方案

您可以选择除了基于正则表达式的规范中列出的每一列。这是排除查询列。请参阅下面的内容:

SELECT语句可以在0.13.0之前的Hive发行版中使用基于regex的列规范,如果配置属性hive在0.13.0和更高发行版中.support.quoted.identifiers设置为none。



也就是说,您可以使用以下命令创建新的表或视图,除指定的列以外的所有列返回:

  hive.support.quoted.identifiers = none; 

drop table if exists database.table_name;
创建表(如果不存在)database.table_name as
选择`(column_to_remove_1 | ... | column_to_remove_N)+。+`
from database.some_table
where
--...
;

这将创建一个表,其中包含来自some_table的所有列,但列名为column_to_remove_1的列除外...... ,到column_to_remove_N。您也可以选择创建一个视图。


The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.

There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak havoc in a big data environment.)

Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.

Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.* inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo or x/y. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.

A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?

This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?

解决方案

You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:

A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.

That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:

hive.support.quoted.identifiers=none;    

drop table if       exists database.table_name;
create table if not exists database.table_name as
    select `(column_to_remove_1|...|column_to_remove_N)?+.+`
    from database.some_table
    where 
    --...
;

This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.

这篇关于Hive UDF用于选择除某些列外的所有内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆