斯卡拉& Spark:回收SQL语句 [英] Scala & Spark: Recycling SQL statements

查看：133 发布时间：2020/9/4 20:11:03 scala apache-spark apache-spark-sql

本文介绍了斯卡拉& Spark:回收SQL语句的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我花了很多时间来编写多个SQL查询，这些查询以前用于获取各种R脚本的数据.这就是它的工作方式

I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked

sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)

线索是，对于某些查询，需要来自先前查询的结果-为什么在数据库本身中创建VIEW并不能解决此问题.使用Spark 2.0，我已经想出了一种方法来完成该任务

The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through

// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName) 
var data = spark.sql(sqlQuery)

但是，根据我的愚见，这很奇怪.另外，更复杂的查询，例如包含子查询因数的查询目前无法正常工作.是否有更健壮的方法，例如使用filter($"")，.select($"")等将SQL代码重新实现为Spark.SQL代码.

But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.

总体目标是获取多个org.apache.spark.sql.DataFrame，每个org.apache.spark.sql.DataFrame代表一个以前的SQL查询的结果(总是几个JOIN，WITH等).因此，n查询导致n DataFrame s.

The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.

有比提供的两个更好的选择吗?

Is there a better option than the provided two?

设置:Win7工作站上的Hadoop v.2.7.3，Spark 2.0.0，Intelli J IDEA 2016.2，Scala 2.11.8，Testcluster

Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation

推荐答案

目前尚不清楚您的要求是什么，但是我想您是说您对以下内容有疑问:

It's not especially clear what your requirement is, but I think you're saying you have queries something like:

SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20

，您将希望像这样声明并有效执行

and you would want to declare and execute this efficiently as

SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20

要实现这一点，我将增强输入文件，以便每个sql语句都有一个关联的表名，结果将存储到该表中.

To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.

例如

PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18

然后像这样循环执行

parseSqlFile().foreach({case (name, query) => {
    val data: DataFrame = execute(query)
    data.createOrReplaceTempView(name)
}

确保按顺序声明查询，以便已创建所有必需的表.其他人则进行更多的解析和依存关系排序.

Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.

在RDMS中，我将这些表称为物化视图".也就是说，对其他数据(例如视图)进行转换，但是将结果缓存起来以备后用.

In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.

这篇关于斯卡拉& Spark:回收SQL语句的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

斯卡拉& Spark:回收SQL语句 [英] Scala & Spark: Recycling SQL statements

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

斯卡拉&amp; Spark:回收SQL语句 [英] Scala &amp; Spark: Recycling SQL statements

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

斯卡拉& Spark:回收SQL语句 [英] Scala & Spark: Recycling SQL statements

登录关闭