SparkSQL:我可以在同一查询中爆炸两个不同的变量吗? [英] SparkSQL : Can I explode two different variables in the same query?
问题描述
我有以下爆炸查询,该查询工作正常:
I have the following explode query, which works fine:
data1 = sqlContext.sql("select explode(names) as name from data")
我想爆炸另一个字段颜色",所以最终输出可能是名称和颜色的笛卡尔积.所以我做到了:
I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:
data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")
但是我得到了错误:
Only one generator allowed per select but Generate and and Explode found.;
有人有什么主意吗?
我实际上可以通过执行以下两个步骤使其工作:
I can actually make it work by doing two steps:
data1 = sqlContext.sql("select explode(names) as name from data")
data1.registerTempTable('data1')
data1 = sqlContext.sql("select explode(colors) as color from data1")
但是我想知道是否可以一步完成?非常感谢!
But I am wondering if it is possible to do it in one step? Thanks a lot!
推荐答案
正确的语法是
select name, color
from data
lateral view explode(names) exploded_names as name
lateral view explode(colors) exploded_colors as color
拉希德(Rashid)的答案不起作用的原因是,它没有命名"名称. LATERAL VIEW
生成的表.
The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW
.
这样考虑:LATERAL VIEW
的工作方式类似于隐式JOIN
,其中为从查看"的集合中structs
的每一行创建了一个临时表.因此,解析语法的方法是:
Think of it this way: LATERAL VIEW
works like an implicit JOIN
with with an ephemeral table created for every row from the structs
in the collection being "viewed". So, the way to parse the syntax is:
LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...
多个输出列
如果使用诸如posexplode()
之类的表生成功能,那么您仍然有一个输出表,但具有多个输出列:
Multiple output columns
If you use a table generating function such as posexplode()
then you still have one output table but with multiple output columns:
LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
嵌套
您也可以嵌套" LATERAL VIEW
通过重复展开嵌套集合,例如
Nesting
You can also "nest" LATERAL VIEW
by repeatedly exploding nested collections, e.g.,
LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item
性能注意事项
虽然我们是LATERAL VIEW
的主题,但必须注意,通过SparkSQL使用它比通过DataFrame
DSL(例如,myDF.explode()
)使用它更有效.原因是当DSL API必须在语言类型和数据帧行之间执行类型转换时,SQL可以准确地对模式进行推理. DSL API在性能方面失去了什么,但是它获得了灵活性,因为您可以从explode
返回任何受支持的类型,这意味着您可以一步完成更复杂的转换.
Performance considerations
While we are on the topic of LATERAL VIEW
it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame
DSL, e.g., myDF.explode()
. The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode
, which means that you can perform a more complicated transformation in one step.
在最新版本的Spark中,不赞成通过df.explode()
行级爆炸,而赞成通过df.select(..., explode(...).as(...))
进行列级爆炸.还有一个explode_outer()
,即使要分解的输入是null
,它也会产生输出行.列级爆炸不会遇到上述行级爆炸的性能问题,因为Spark可以完全使用内部行数据表示来执行转换.
In recent versions of Spark, row-level explode via df.explode()
has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...))
. There is also an explode_outer()
, which will produce output rows even if the input to be exploded is null
. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.
这篇关于SparkSQL:我可以在同一查询中爆炸两个不同的变量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!