火花从mysql并行读取数据 [英] spark reading data from mysql in parallel

查看:86
本文介绍了火花从mysql并行读取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从mysql读取数据并将其写回到s3中具有特定分区的镶木地板文件中,如下所示:

Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows:

df=sqlContext.read.format('jdbc')\
   .options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""",
         dbtable='tbl',
         numPartitions=4 )\
   .load()


df2=df.withColumn('updated_date',to_date(df.updated_at))
df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date'])

我的问题是它只打开一个与mysql的连接(而不是4),并且直到从mysql中获取所有数据后才写入parquert,因为mysql中的表很大(100M行),该过程失败了在OutOfMemory上.

My problem is that it open only one connection to mysql (instead of 4) and it doesn't write to parquert until it fetches all the data from mysql, because my table in mysql is huge (100M rows) the process failed on OutOfMemory.

是否可以将Spark配置为打开多个与mysql的连接并将部分数据写入镶木地板?

Is there a way to configure Spark to open more than one connection to mysql and to write partial data to parquet?

推荐答案

您应设置以下属性:

partitionColumn, 
lowerBound, 
upperBound, 
numPartitions

此处记录如下: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

这篇关于火花从mysql并行读取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆