spark从mysql并行读取数据 [英] spark reading data from mysql in parallel

查看：48 发布时间：2021/11/14 21:40:16 mysql apache-spark pyspark apache-spark-sql

本文介绍了spark从mysql并行读取数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 mysql 读取数据并将其写回 s3 中具有特定分区的 parquet 文件，如下所示:

Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows:

df=sqlContext.read.format('jdbc')\
   .options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""",
         dbtable='tbl',
         numPartitions=4 )\
   .load()


df2=df.withColumn('updated_date',to_date(df.updated_at))
df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date'])

我的问题是它只打开一个到 mysql 的连接(而不是 4 个)，并且在它从 mysql 获取所有数据之前它不会写入 parquert，因为我在 mysql 中的表很大(100M 行)进程失败内存不足.

My problem is that it open only one connection to mysql (instead of 4) and it doesn't write to parquert until it fetches all the data from mysql, because my table in mysql is huge (100M rows) the process failed on OutOfMemory.

有没有办法配置Spark打开多个mysql连接并将部分数据写入parquet?

Is there a way to configure Spark to open more than one connection to mysql and to write partial data to parquet?

spark从mysql并行读取数据 [英] spark reading data from mysql in parallel

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

spark从mysql并行读取数据 [英] spark reading data from mysql in parallel

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭