SparkSQL:我可以在同一查询中爆炸两个不同的变量吗? [英] SparkSQL : Can I explode two different variables in the same query?

查看:191
本文介绍了SparkSQL:我可以在同一查询中爆炸两个不同的变量吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下爆炸查询,该查询工作正常:

I have the following explode query, which works fine:

data1 = sqlContext.sql("select explode(names) as name from data")

我想爆炸另一个字段颜色",所以最终输出可能是名称和颜色的笛卡尔积.所以我做到了:

I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:

data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")

但是我得到了错误:

 Only one generator allowed per select but Generate and and Explode found.;

有人有什么主意吗?

我实际上可以通过执行以下两个步骤使其工作:

I can actually make it work by doing two steps:

   data1 = sqlContext.sql("select explode(names) as name from data")
   data1.registerTempTable('data1')
   data1 = sqlContext.sql("select explode(colors) as color from data1")

但是我想知道是否可以一步完成?非常感谢!

But I am wondering if it is possible to do it in one step? Thanks a lot!

推荐答案

正确的语法是

select name, color 
from data 
lateral view explode(names) exploded_names as name 
lateral view explode(colors) exploded_colors as color

拉希德(Rashid)的答案不起作用的原因是,它没有命名"名称. LATERAL VIEW生成的表.

The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW.

这样考虑:LATERAL VIEW的工作方式类似于隐式JOIN,其中为从查看"的集合中structs的每一行创建了一个临时表.因此,解析语法的方法是:

Think of it this way: LATERAL VIEW works like an implicit JOIN with with an ephemeral table created for every row from the structs in the collection being "viewed". So, the way to parse the syntax is:

LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...

多个输出列

如果使用诸如posexplode()之类的表生成功能,那么您仍然有一个输出表,但具有多个输出列:

Multiple output columns

If you use a table generating function such as posexplode() then you still have one output table but with multiple output columns:

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order

嵌套

您也可以嵌套" LATERAL VIEW通过重复展开嵌套集合,例如

Nesting

You can also "nest" LATERAL VIEW by repeatedly exploding nested collections, e.g.,

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item

性能注意事项

虽然我们是LATERAL VIEW的主题,但必须注意,通过SparkSQL使用它比通过DataFrame DSL(例如,myDF.explode())使用它更有效.原因是当DSL API必须在语言类型和数据帧行之间执行类型转换时,SQL可以准确地对模式进行推理. DSL API在性能方面失去了什么,但是它获得了灵活性,因为您可以从explode返回任何受支持的类型,这意味着您可以一步完成更复杂的转换.

Performance considerations

While we are on the topic of LATERAL VIEW it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame DSL, e.g., myDF.explode(). The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode, which means that you can perform a more complicated transformation in one step.

在最新版本的Spark中,不赞成通过df.explode()行级爆炸,而赞成通过df.select(..., explode(...).as(...))进行列级爆炸.还有一个explode_outer(),即使要分解的输入是null,它也会产生输出行.列级爆炸不会遇到上述行级爆炸的性能问题,因为Spark可以完全使用内部行数据表示来执行转换.

In recent versions of Spark, row-level explode via df.explode() has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...)). There is also an explode_outer(), which will produce output rows even if the input to be exploded is null. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.

这篇关于SparkSQL:我可以在同一查询中爆炸两个不同的变量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆