斯卡拉 &Spark:回收 SQL 语句 [英] Scala & Spark: Recycling SQL statements
问题描述
我花了很多时间编写多个 SQL 查询,这些查询以前用于为各种 R
脚本获取数据.这就是它的工作原理
I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R
scripts. This is how it worked
sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)
线索是,对于某些查询,需要来自先前查询的结果 - 为什么在数据库中创建 VIEW
本身并不能解决这个问题.使用 Spark 2.0
我已经想出了一种方法来做到这一点
The clue is, that for some queries a result from a prior query is required - why creating VIEW
s in the database itself does not solve this problem. With Spark 2.0
I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName)
var data = spark.sql(sqlQuery)
但在我看来,这是非常繁琐的.此外,更复杂的查询,例如包含子查询分解的查询当前不起作用.有没有更健壮的方法,比如使用 filter($"")
, .select($"") 将 SQL 代码重新实现为
等Spark.SQL
代码
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL
code using filter($"")
, .select($"")
, etc.
总体目标是获得多个org.apache.spark.sql.DataFrame
,每个代表一个前一个SQL查询的结果(总是几个JOIN
s、WITH
s 等).所以 n
查询导致 n
DataFrame
s.
The overall goal is to get multiple org.apache.spark.sql.DataFrame
s, each representing the results of one former SQL query (which always a few JOIN
s, WITH
s, etc.). So n
queries leading to n
DataFrame
s.
有没有比提供的两个更好的选择?
Is there a better option than the provided two?
设置:Hadoop v.2.7.3
、Spark 2.0.0
、Intelli J IDEA 2016.2
、Scala 2.11.8
,Win7工作站上的Testcluster
Setup: Hadoop v.2.7.3
, Spark 2.0.0
, Intelli J IDEA 2016.2
, Scala 2.11.8
, Testcluster on Win7 Workstation
推荐答案
你的要求不是特别清楚,但我想你是说你有这样的查询:
It's not especially clear what your requirement is, but I think you're saying you have queries something like:
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20
并且您希望将其有效地声明和执行为
and you would want to declare and execute this efficiently as
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20
为了实现这一点,我将增强输入文件,以便每个 sql 语句都有一个关联的表名,结果将存储到其中.
To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.
例如
PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18
然后像这样循环执行
parseSqlFile().foreach({case (name, query) => {
val data: DataFrame = execute(query)
data.createOrReplaceTempView(name)
}
确保按顺序声明查询,以便创建所有必需的表.其他做更多的解析和按依赖项排序.
Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.
在 RDMS 中,我将这些表称为物化视图.即对其他数据的转换,如视图,但结果被缓存以备后用.
In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.
这篇关于斯卡拉 &Spark:回收 SQL 语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!