创建许多短暂的SparkSession [英] Creating many, short-living SparkSessions

查看:324
本文介绍了创建许多短暂的SparkSession的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可以协调批处理作业执行的应用程序,我想为每个作业执行创建一个SparkSession-尤其是为了清晰分离已注册的临时视图,函数等.

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.

因此,这将导致每天数千次SparkSession,仅在工作期间(从几分钟到几个小时)有效.有什么理由不这样做吗?

So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?

我知道一个事实,每个JVM只有一个SparkContext.我还知道SparkContext执行某些JVM全局缓存,但这对这种情况到底意味着什么?什么是缓存在SparkContext中,如果使用这些会话执行许多火花作业,将会发生什么?

I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?

推荐答案

这显示了如何使用不同的配置来构建多个会话

This shows how multiple sessions can be build with different configures

使用

spark1.clearActiveSession();

spark1.clearDefaultSession();

要清除会话.

 SparkSession spark1 = SparkSession.builder()
            .master("local[*]")
            .appName("app1")
            .getOrCreate();
    Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
    df.show();
    spark1.clearActiveSession();
    spark1.clearDefaultSession();
    SparkSession spark2 = SparkSession.builder()
            .master("local[*]")
            .appName("app2")
            .getOrCreate();
    Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
    df2.show();

对于您的问题. Spark上下文将rdds保存在内存中,以加快处理速度. 如果有大量数据.保存表或rdds将移至hdd. 如果会话在任何时候都另存为视图,则可以访问表. 最好使用唯一的ID为您的作业执行多个spark-submit,而不要使用不同的配置.

For your questions. Spark context save the rdds in memory for quicker processing. If there is lot of data . The save tables or rdds are moved to the hdd . A session can access the tables if it saved as a view at any point. It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

这篇关于创建许多短暂的SparkSession的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆