SparkSession.sql和Dataset.sqlContext.sql有什么区别? [英] What's the difference between SparkSession.sql and Dataset.sqlContext.sql?

查看:207
本文介绍了SparkSession.sql和Dataset.sqlContext.sql有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码片段,我想知道这两者之间有什么区别,应该使用哪一个?我正在使用Spark 2.2.

I have the following snippets of the code and I wonder what is the difference between these two and which one should I use? I am using spark 2.2.

Dataset<Row> df = sparkSession.readStream()
    .format("kafka")
    .load();

df.createOrReplaceTempView("table");
df.printSchema();

Dataset<Row> resultSet =  df.sqlContext().sql("select value from table"); //sparkSession.sql(this.query);
StreamingQuery streamingQuery = resultSet
        .writeStream()
        .trigger(Trigger.ProcessingTime(1000))
        .format("console")
        .start();

vs

Dataset<Row> df = sparkSession.readStream()
    .format("kafka")
    .load();

df.createOrReplaceTempView("table");

Dataset<Row> resultSet =  sparkSession.sql("select value from table"); //sparkSession.sql(this.query);
StreamingQuery streamingQuery = resultSet
        .writeStream()
        .trigger(Trigger.ProcessingTime(1000))
        .format("console")
        .start();

推荐答案

sparkSession.sql("sql query")df.sqlContext().sql("sql query")之间存在很小的区别.

There is a very subtle difference between sparkSession.sql("sql query") vs df.sqlContext().sql("sql query").

请注意,在单个Spark应用程序中可以有零个,两个或多个SparkSession(但是假设您在 Spark SQL 中至少(通常)只有一个SparkSession >应用程序.)

Please note that you can have zero, two or more SparkSessions in a single Spark application (but it's assumed you'll have at least and often only one SparkSession in a Spark SQL application).

请注意,Dataset绑定到在其中创建的SparkSession,并且SparkSession永远不会改变.

Please also note that a Dataset is bound to the SparkSession it was created within and the SparkSession will never change.

您可能想知道为什么有人会想要它,但是这给了您查询之间的界限,并且您可以为不同的数据集使用相同的表名,而这实际上是Spark SQL的一个非常强大的功能.

You may be wondering why anyone would want it, but that gives you boundary between queries and you could use the same table names for different datasets and that is a very powerful feature of Spark SQL actually.

下面的示例演示了它们之间的区别,并希望您能最终了解它为何功能强大的原因.

The following example shows the difference and hopefully will give you some idea why it's powerful after all.

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

scala> :type spark
org.apache.spark.sql.SparkSession

scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

scala> val df = spark.range(5)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.sqlContext.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

scala> val anotherSession = spark.newSession
anotherSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@195c5803

scala> anotherSession.range(10).createOrReplaceTempView("new_table")

scala> anotherSession.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|        |new_table|       true|
+--------+---------+-----------+


scala> df.sqlContext.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

这篇关于SparkSession.sql和Dataset.sqlContext.sql有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆