任务仅在Spark中的一个执行程序上运行 [英] Task is running on only one executor in spark

查看:68
本文介绍了任务仅在Spark中的一个执行程序上运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Java在spark代码下运行.

I am running below code in spark using Java.

代码

Test.java

package com.sample;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.storage.StorageLevel;

import com.addition.AddTwoNumbers;

public class Test{

    private static final String APP_NAME = "Test";
    private static final String LOCAL = "local";
    private static final String MASTER_IP = "spark://10.180.181.26:7077";

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName(APP_NAME).setMaster(MASTER_IP);
        String connection = "jdbc:oracle:thin:test/test@//xyz00aie.in.oracle.com:1521/PDX2600N";
        // Create Spark Context
        SparkContext context = new SparkContext(conf);
        // Create Spark Session

        SparkSession sparkSession = new SparkSession(context);
        long startTime = System.currentTimeMillis();
        System.out.println("Start time is : " + startTime);
        Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection)
                .option("dbtable", "CI_TXN_DETAIL_STG_100M").load();

        System.out.println(txnDf.filter((txnDf.col("TXN_DETAIL_ID").gt(new Integer(1286001510)))
                .and(txnDf.col("TXN_DETAIL_ID").lt(new Integer(1303001510)))).count());


        sparkSession.stop();
    }

}

我只是想查找行数范围.范围是2000万.

I am simply trying to find count of range of rows. Range is 20 Million.

下面是Spark信息中心的快照

在这里,我只能在一个执行器上看到活动任务.我总共有10位执行程序正在运行.

Here I can see Active task only on one Executor. I have total of 10 Executors running.

我的问题

为什么我的应用程序在一个执行程序上显示活动任务,而不是将其分配给所有10个执行程序?

Why my application is showing active task on one Executor instead of distributing it across all 10 executors?

以下是我的 spark-submit 命令:

./spark-submit --class com.sample.Test--conf spark.sql.shuffle.partitions=5001 --conf spark.yarn.executor.memoryOverhead=11264 --executor-memory=91GB --conf spark.yarn.driver.memoryOverhead=11264 --driver-memory=91G --executor-cores=17  --driver-cores=17 --conf spark.default.parallelism=306 --jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,/scratch/rmbbuild/spark_ormb/drools-jars/Addition-1.0.jar --driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar --master spark://10.180.181.26:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Test-0.0.1-SNAPSHOT.jar" > /scratch/rmbbuild/spark_ormb/POC-jar/logs/log18.txt

推荐答案

看起来所有数据都在一个分区中读取,并交给一个执行程序.为了使用更多的执行程序,必须创建更多的分区.可以将参数"numPartitions"与分区列一起使用,如此处指定:

Looks like all data are read in one partition, and goes to one executor. For use more executors, more partitions have to be created. Parameter "numPartitions" can be used, along with partition column, as specified here:

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#jdbc-reads

此链接也可能有用:

Spark:read.jdbc中numPartition之间的差异(..numPartitions ..)和重新分区(..numPartitions ..)

这篇关于任务仅在Spark中的一个执行程序上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆