创建许多短暂的SparkSession [英] Creating many, short-living SparkSessions

查看：324 发布时间：2020/9/4 7:59:36 apache-spark

本文介绍了创建许多短暂的SparkSession的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个可以协调批处理作业执行的应用程序，我想为每个作业执行创建一个SparkSession-尤其是为了清晰分离已注册的临时视图，函数等.

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.

因此，这将导致每天数千次SparkSession，仅在工作期间(从几分钟到几个小时)有效.有什么理由不这样做吗?

So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?

我知道一个事实，每个JVM只有一个SparkContext.我还知道SparkContext执行某些JVM全局缓存，但这对这种情况到底意味着什么?什么是缓存在SparkContext中，如果使用这些会话执行许多火花作业，将会发生什么?

I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?

推荐答案

这显示了如何使用不同的配置来构建多个会话

This shows how multiple sessions can be build with different configures

使用

spark1.clearActiveSession();

spark1.clearDefaultSession();

要清除会话.

 SparkSession spark1 = SparkSession.builder()
            .master("local[*]")
            .appName("app1")
            .getOrCreate();
    Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
    df.show();
    spark1.clearActiveSession();
    spark1.clearDefaultSession();
    SparkSession spark2 = SparkSession.builder()
            .master("local[*]")
            .appName("app2")
            .getOrCreate();
    Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
    df2.show();

对于您的问题. Spark上下文将rdds保存在内存中，以加快处理速度. 如果有大量数据.保存表或rdds将移至hdd. 如果会话在任何时候都另存为视图，则可以访问表. 最好使用唯一的ID为您的作业执行多个spark-submit，而不要使用不同的配置.

For your questions. Spark context save the rdds in memory for quicker processing. If there is lot of data . The save tables or rdds are moved to the hdd . A session can access the tables if it saved as a view at any point. It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

这篇关于创建许多短暂的SparkSession的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

创建许多短暂的SparkSession [英] Creating many, short-living SparkSessions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

创建许多短暂的SparkSession [英] Creating many, short-living SparkSessions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭