SparkSession.sql和Dataset.sqlContext.sql有什么区别? [英] What's the difference between SparkSession.sql and Dataset.sqlContext.sql?
问题描述
我有以下代码片段,我想知道这两者之间有什么区别,应该使用哪一个?我正在使用Spark 2.2.
I have the following snippets of the code and I wonder what is the difference between these two and which one should I use? I am using spark 2.2.
Dataset<Row> df = sparkSession.readStream()
.format("kafka")
.load();
df.createOrReplaceTempView("table");
df.printSchema();
Dataset<Row> resultSet = df.sqlContext().sql("select value from table"); //sparkSession.sql(this.query);
StreamingQuery streamingQuery = resultSet
.writeStream()
.trigger(Trigger.ProcessingTime(1000))
.format("console")
.start();
vs
Dataset<Row> df = sparkSession.readStream()
.format("kafka")
.load();
df.createOrReplaceTempView("table");
Dataset<Row> resultSet = sparkSession.sql("select value from table"); //sparkSession.sql(this.query);
StreamingQuery streamingQuery = resultSet
.writeStream()
.trigger(Trigger.ProcessingTime(1000))
.format("console")
.start();
推荐答案
sparkSession.sql("sql query")
与df.sqlContext().sql("sql query")
之间存在很小的区别.
There is a very subtle difference between sparkSession.sql("sql query")
vs df.sqlContext().sql("sql query")
.
请注意,在单个Spark应用程序中可以有零个,两个或多个SparkSession
(但是假设您在 Spark SQL 中至少(通常)只有一个SparkSession
>应用程序.)
Please note that you can have zero, two or more SparkSession
s in a single Spark application (but it's assumed you'll have at least and often only one SparkSession
in a Spark SQL application).
请注意,Dataset
绑定到在其中创建的SparkSession
,并且SparkSession
永远不会改变.
Please also note that a Dataset
is bound to the SparkSession
it was created within and the SparkSession
will never change.
您可能想知道为什么有人会想要它,但是这给了您查询之间的界限,并且您可以为不同的数据集使用相同的表名,而这实际上是Spark SQL的一个非常强大的功能.
You may be wondering why anyone would want it, but that gives you boundary between queries and you could use the same table names for different datasets and that is a very powerful feature of Spark SQL actually.
下面的示例演示了它们之间的区别,并希望您能最终了解它为何功能强大的原因.
The following example shows the difference and hopefully will give you some idea why it's powerful after all.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
scala> val df = spark.range(5)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.sqlContext.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
scala> val anotherSession = spark.newSession
anotherSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@195c5803
scala> anotherSession.range(10).createOrReplaceTempView("new_table")
scala> anotherSession.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| |new_table| true|
+--------+---------+-----------+
scala> df.sqlContext.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
这篇关于SparkSession.sql和Dataset.sqlContext.sql有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!