SparkSession.sql 和 Dataset.sqlContext.sql 有什么区别? [英] What's the difference between SparkSession.sql and Dataset.sqlContext.sql?

查看:37
本文介绍了SparkSession.sql 和 Dataset.sqlContext.sql 有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码片段,我想知道这两者之间有什么区别,我应该使用哪一个?我使用的是 spark 2.2.

I have the following snippets of the code and I wonder what is the difference between these two and which one should I use? I am using spark 2.2.

Dataset<Row> df = sparkSession.readStream()
    .format("kafka")
    .load();

df.createOrReplaceTempView("table");
df.printSchema();

Dataset<Row> resultSet =  df.sqlContext().sql("select value from table"); //sparkSession.sql(this.query);
StreamingQuery streamingQuery = resultSet
        .writeStream()
        .trigger(Trigger.ProcessingTime(1000))
        .format("console")
        .start();

对比

Dataset<Row> df = sparkSession.readStream()
    .format("kafka")
    .load();

df.createOrReplaceTempView("table");

Dataset<Row> resultSet =  sparkSession.sql("select value from table"); //sparkSession.sql(this.query);
StreamingQuery streamingQuery = resultSet
        .writeStream()
        .trigger(Trigger.ProcessingTime(1000))
        .format("console")
        .start();

推荐答案

sparkSession.sql("sql query") 之间存在一个非常细微的区别>df.sqlContext().sql("sql 查询").

There is a very subtle difference between sparkSession.sql("sql query") vs df.sqlContext().sql("sql query").

请注意,单个 Spark 应用程序中可以有零个、两个或多个 SparkSession(但假设您至少并且通常只有一个 SparkSessionSpark SQL 应用程序中).

Please note that you can have zero, two or more SparkSessions in a single Spark application (but it's assumed you'll have at least and often only one SparkSession in a Spark SQL application).

另请注意,Dataset 绑定到它在其中创建的 SparkSession,并且 SparkSession 永远不会改变.

Please also note that a Dataset is bound to the SparkSession it was created within and the SparkSession will never change.

您可能想知道为什么有人会想要它,但这为您提供了查询之间的界限,并且您可以对不同的数据集使用相同的表名,这实际上是 Spark SQL 的一个非常强大的功能.

You may be wondering why anyone would want it, but that gives you boundary between queries and you could use the same table names for different datasets and that is a very powerful feature of Spark SQL actually.

以下示例显示了不同之处,希望能让您了解为什么它如此强大.

The following example shows the difference and hopefully will give you some idea why it's powerful after all.

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

scala> :type spark
org.apache.spark.sql.SparkSession

scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

scala> val df = spark.range(5)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.sqlContext.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

scala> val anotherSession = spark.newSession
anotherSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@195c5803

scala> anotherSession.range(10).createOrReplaceTempView("new_table")

scala> anotherSession.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|        |new_table|       true|
+--------+---------+-----------+


scala> df.sqlContext.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

这篇关于SparkSession.sql 和 Dataset.sqlContext.sql 有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆