任务仅在Spark中的一个执行程序上运行 [英] Task is running on only one executor in spark
问题描述
我正在使用Java在spark代码下运行.
I am running below code in spark using Java.
代码
Test.java
package com.sample;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.storage.StorageLevel;
import com.addition.AddTwoNumbers;
public class Test{
private static final String APP_NAME = "Test";
private static final String LOCAL = "local";
private static final String MASTER_IP = "spark://10.180.181.26:7077";
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName(APP_NAME).setMaster(MASTER_IP);
String connection = "jdbc:oracle:thin:test/test@//xyz00aie.in.oracle.com:1521/PDX2600N";
// Create Spark Context
SparkContext context = new SparkContext(conf);
// Create Spark Session
SparkSession sparkSession = new SparkSession(context);
long startTime = System.currentTimeMillis();
System.out.println("Start time is : " + startTime);
Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection)
.option("dbtable", "CI_TXN_DETAIL_STG_100M").load();
System.out.println(txnDf.filter((txnDf.col("TXN_DETAIL_ID").gt(new Integer(1286001510)))
.and(txnDf.col("TXN_DETAIL_ID").lt(new Integer(1303001510)))).count());
sparkSession.stop();
}
}
我只是想查找行数范围.范围是2000万.
I am simply trying to find count of range of rows. Range is 20 Million.
下面是Spark信息中心的快照
在这里,我只能在一个执行器上看到活动任务.我总共有10位执行程序正在运行.
Here I can see Active task only on one Executor. I have total of 10 Executors running.
我的问题
为什么我的应用程序在一个执行程序上显示活动任务,而不是将其分配给所有10个执行程序?
Why my application is showing active task on one Executor instead of distributing it across all 10 executors?
以下是我的 spark-submit 命令:
./spark-submit --class com.sample.Test--conf spark.sql.shuffle.partitions=5001 --conf spark.yarn.executor.memoryOverhead=11264 --executor-memory=91GB --conf spark.yarn.driver.memoryOverhead=11264 --driver-memory=91G --executor-cores=17 --driver-cores=17 --conf spark.default.parallelism=306 --jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,/scratch/rmbbuild/spark_ormb/drools-jars/Addition-1.0.jar --driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar --master spark://10.180.181.26:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Test-0.0.1-SNAPSHOT.jar" > /scratch/rmbbuild/spark_ormb/POC-jar/logs/log18.txt
推荐答案
看起来所有数据都在一个分区中读取,并交给一个执行程序.为了使用更多的执行程序,必须创建更多的分区.可以将参数"numPartitions"与分区列一起使用,如此处指定:
Looks like all data are read in one partition, and goes to one executor. For use more executors, more partitions have to be created. Parameter "numPartitions" can be used, along with partition column, as specified here:
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#jdbc-reads
此链接也可能有用:
Spark:read.jdbc中numPartition之间的差异(..numPartitions ..)和重新分区(..numPartitions ..)
这篇关于任务仅在Spark中的一个执行程序上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!