使用extraOptimizations转换Spark SQL AST [英] Transforming Spark SQL AST with extraOptimizations

查看:80
本文介绍了使用extraOptimizations转换Spark SQL AST的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将一个SQL字符串作为用户输入,然后在执行之前对其进行转换.特别是,我想修改顶层投影(选择子句),注入要由查询检索的其他列.

I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query.

我希望通过使用sparkSession.experimental.extraOptimizations连接到Catalyst来实现这一目标.我知道我要尝试的并不是严格意义上的优化(转换会更改SQL语句的语义),但是API似乎仍然合适.但是,查询执行程序似乎忽略了我的转换.

I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations. I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable. However, my transformation seems to be ignored by the query executor.

这里是一个最小的例子来说明我遇到的问题.首先定义一个行案例类:

Here is a minimal example to illustrate the issue I'm having. First define a row case class:

case class TestRow(a: Int, b: Int, c: Int)

然后定义一个优化规则,该规则将简单地丢弃任何投影:

Then define an optimisation rule which simply discards any projection:

object RemoveProjectOptimisationRule extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
        case x: Project => x.child
    }
}

现在创建一个数据集,注册优化并运行SQL查询:

Now create a dataset, register the optimisation, and run a SQL query:

// Create a dataset and register table.
val dataset = List(TestRow(1, 2, 3)).toDS()
val tableName: String = "testtable"
dataset.createOrReplaceTempView(tableName)

// Register "optimisation".
sparkSession.experimental.extraOptimizations =  
    Seq(RemoveProjectOptimisationRule)

// Run query.
val projected = sqlContext.sql("SELECT a FROM " + tableName + " WHERE a = 1")

// Print query result and the queryExecution object.
println("Query result:")
projected.collect.foreach(println)
println(projected.queryExecution)

以下是输出:

Query result: 
[1]

== Parsed Logical Plan ==
'Project ['a]
+- 'Filter ('a = 1)
   +- 'UnresolvedRelation `testtable`

== Analyzed Logical Plan ==
a: int
Project [a#3]
+- Filter (a#3 = 1)
   +- SubqueryAlias testtable
      +- LocalRelation [a#3, b#4, c#5]

== Optimized Logical Plan ==
Filter (a#3 = 1)
+- LocalRelation [a#3, b#4, c#5]

== Physical Plan ==
*Filter (a#3 = 1)
+- LocalTableScan [a#3, b#4, c#5]

我们看到结果与原始SQL语句的结果相同,而未应用转换.但是,在打印逻辑计划和物理计划时,确实已删除了投影.我还(通过调试日志输出)确认该转换确实已被调用.

We see that the result is identical to that of the original SQL statement, without the transformation applied. Yet, when printing the logical and physical plans, the projection has indeed been removed. I've also confirmed (through debug log output) that the transformation is indeed being invoked.

关于这里发生的事情有什么建议吗?也许优化器只是忽略了改变语义的优化"?

Any suggestions as to what's going on here? Maybe the optimiser simply ignores "optimisations" that change semantics?

如果不是使用优化的方法,那么有人可以建议替代方法吗?我真正想做的就是解析输入的SQL语句,对其进行转换,然后将转换后的AST传递给Spark以执行.但据我所知,执行此操作的API是Spark sql包专用的.可能可以使用反射,但是我想避免这种情况.

If using the optimisations isn't the way to go, can anybody suggest an alternative? All I really want to do is parse the input SQL statement, transform it, and pass the transformed AST to Spark for execution. But as far as I can see, the APIs for doing this are private to the Spark sql package. It may be possible to use reflection, but I'd like to avoid that.

任何指针将不胜感激.

推荐答案

正如您所猜测的那样,此方法行不通,因为我们假设优化器不会更改查询结果.

As you guessed, this is failing to work because we make assumptions that the optimizer will not change the results of the query.

具体来说,我们缓存从分析器出来的模式(并假设优化器不会更改它).将行转换为外部格式时,我们使用此架构,因此会截断结果中的列.如果您所做的不只是截断(即更改了数据类型),甚至可能会崩溃.

Specifically, we cache the schema that comes out of the analyzer (and assume the optimizer does not change it). When translating rows to the external format, we use this schema and thus are truncating the columns in the result. If you did more than truncate (i.e. changed datatypes) this might even crash.

如您在这笔记本,实际上它可以产生您期望得到的结果.我们计划在不久的将来打开更多的挂钩,使您可以在查询执行的其他阶段修改计划.有关更多详细信息,请参见 SPARK-18127 .

As you can see in this notebook, it is in fact producing the result you would expect under the covers. We are planning to open up more hooks at some point in the near future that would let you modify the plan at other phases of query execution. See SPARK-18127 for more details.

这篇关于使用extraOptimizations转换Spark SQL AST的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆