与 csv 文件相比，将 mysql 表转换为 spark 数据集非常慢 [英] Converting mysql table to spark dataset is very slow compared to same from csv file

查看：31 发布时间：2021/11/12 5:45:36 java mysql apache-spark jdbc amazon-s3

本文介绍了与 csv 文件相比，将 mysql 表转换为 spark 数据集非常慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Amazon s3 中有一个 csv 文件，大小为 62mb(114 000 行).我将其转换为 spark 数据集，并从中取出前 500 行.代码如下；

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"@"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");

 set.take(500)

整个操作需要 20 到 30 秒.

The whole operation takes 20 to 30 sec.

现在我正在尝试相同的方法，而是使用 csv 我使用的是包含 119 000 行的 mySQL 表.MySQL 服务器在亚马逊 ec2 中.代码如下；

Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;

String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;

SparkSession spark=StartSpark.getSparkSession();

SQLContext sc = spark.sqlContext();

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
            .read()
            .option("url", url)
            .option("dbtable", this.tableName)
            .option("driver","com.mysql.jdbc.Driver")
            .format("jdbc")
            .load();
set.take(500);

这需要 5 到 10 分钟.我在 jvm 中运行 spark.在这两种情况下使用相同的配置.

This is taking 5 to 10 minutes. I am running spark inside jvm. Using same configuration in both cases.

我可以使用 partitionColumn、numParttition 等，但我没有任何数字列，还有一个问题是我不知道表的架构.

我的问题不是如何减少所需的时间，因为我知道在理想情况下 spark 将在集群中运行，但我不明白为什么在上述两种情况下会有这么大的时间差异?

My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?

推荐答案

这个问题在 StackOverflow 上已经多次提到:

This problem has been covered multiple times on StackOverflow:

和外部来源:

https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads

所以只是重申 - 默认情况下 DataFrameReader.jdbc 不分发数据或读取.它使用单线程，单执行器.

so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.

分发读取:

将范围与 lowerBound/upperBound 一起使用:

Properties properties;
Lower

Dataset<Row> set = sc
    .read()
    .option("partitionColumn", "foo")
    .option("numPartitions", "3")
    .option("lowerBound", 0)
    .option("upperBound", 30)
    .option("url", url)
    .option("dbtable", this.tableName)
    .option("driver","com.mysql.jdbc.Driver")
    .format("jdbc")
    .load();

谓词

Properties properties;
Dataset<Row> set = sc
    .read()
    .jdbc(
        url, this.tableName,
        {"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
        properties
    )

这篇关于与 csv 文件相比，将 mysql 表转换为 spark 数据集非常慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

与 csv 文件相比，将 mysql 表转换为 spark 数据集非常慢 [英] Converting mysql table to spark dataset is very slow compared to same from csv file

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

与 csv 文件相比，将 mysql 表转换为 spark 数据集非常慢 [英] Converting mysql table to spark dataset is very slow compared to same from csv file

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭