Hive UDF 用于选择除某些列之外的所有列 [英] Hive UDF for selecting all except some columns

查看:28
本文介绍了Hive UDF 用于选择除某些列之外的所有列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HiveQL(以及一般的 SQL)中的常见查询构建模式是选择所有列 (SELECT *) 或明确指定的一组列 (SELECT A, B,C).除了一组指定的列之外,SQL 没有用于选择所有列的内置机制.

The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.

有多种机制可以排除某些列,如 this SO question 但没有一个自然适用于 HiveQL.(例如,使用 SELECT * 创建一个临时表然后 ALTER TABLE DROP 其某些列的想法会在大数据环境中造成严重破坏.)

There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak havoc in a big data environment.)

忽略关于选择除某些列之外的所有列是否是一个好主意的意识形态讨论,这个问题是关于使用此功能扩展 Hive 的可能方法.

Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.

在 Hive 0.13.0 之前 SELECT 可以采用基于正则表达式的列,例如,property_.* 在反引号字符串中.@invoketheshell 下面的回答指的是这个功能,但它是有代价的,也就是说,当这个功能打开时,Hive 不能接受其中包含非标准字符的列,例如 $foo 或 <代码>x/y.这就是 Hive 开发人员在 0.13.0 中默认关闭此行为的原因.我正在寻找适用于任何列名称的通用解决方案.

Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.* inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo or x/y. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.

通用表生成 UDF (UDTF) 当然可以这样做,因为它可以操纵架构.既然我们不打算生成新行,那么有没有办法使用简单的基于行的 UDF 来解决这个问题?

A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?

这似乎是一个常见问题,网络上的许多帖子都展示了如何为各种数据库解决这个问题,但我还没有找到适用于 Hive 的解决方案.是否有代码可以做到这一点?

This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?

推荐答案

您可以选择除基于正则表达式的规范中列出的列之外的每一列.这是通过排除查询列.见下文:

You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:

如果配置属性 hive.support.quoted.identifiers 设置为 none,则 SELECT 语句可以在 0.13.0 之前的 Hive 版本或 0.13.0 及更高版本中采用基于正则表达式的列规范.

A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.

话虽如此,您可以使用以下内容创建新表或视图,并且将返回除指定列之外的所有列:

That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:

hive.support.quoted.identifiers=none;    

drop table if       exists database.table_name;
create table if not exists database.table_name as
    select `(column_to_remove_1|...|column_to_remove_N)?+.+`
    from database.some_table
    where 
    --...
;

这将创建一个包含 some_table 中所有列的表,除了名为 column_to_remove_1, ... 到 column_to_remove_N 的列.您也可以选择创建视图.

This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.

这篇关于Hive UDF 用于选择除某些列之外的所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆