与csv文件相比,将mysql表转换为spark数据集的速度非常慢 [英] Converting mysql table to spark dataset is very slow compared to same from csv file
问题描述
我在亚马逊s3中有csv文件,大小为62mb(114000行)。我正在将其转换为spark数据集,并从中获取前500行。代码如下;
I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"@"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
整个操作需要20到30 sec。
The whole operation takes 20 to 30 sec.
现在我正在尝试相同但是使用csv我正在使用带有119 000行的mySQL表。 MySQL服务器在亚马逊ec2中。代码如下;
Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
这需要5到10分钟。
我在jvm中运行spark。在两种情况下都使用相同的配置。
This is taking 5 to 10 minutes. I am running spark inside jvm. Using same configuration in both cases.
我可以使用partitionColumn,numParttition等但我没有任何数字列,还有一个问题是我不知道表格。
我的问题不是如何减少所需的时间,因为我知道在理想情况下spark会在集群中运行但我能做什么不明白为什么在上述两种情况下这个大的时间差异?
My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?
推荐答案
这个问题已在StackOverflow上多次覆盖:
This problem has been covered multiple times on StackOverflow:
- How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
- spark jdbc df limit... what is it doing?
- How to use JDBC source to write and read data in (Py)Spark?
和外部来源:
- https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
所以重申 - 默认情况下 DataFrameReader.jdbc
不会分发数据或读取。它使用单线程,单个exectuor。
so just to reiterate - by default DataFrameReader.jdbc
doesn't distribute data or reads. It uses single thread, single exectuor.
分发读取:
-
使用范围
lowerBound
/upperBound
:
Properties properties;
Lower
Dataset<Row> set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
谓词
Properties properties;
Dataset<Row> set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)
这篇关于与csv文件相比,将mysql表转换为spark数据集的速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!