将DataFrame保存到Parquet需要花费大量时间 [英] Saving DataFrame to Parquet takes lot of time

查看:108
本文介绍了将DataFrame保存到Parquet需要花费大量时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个火花数据框,它有大约458MM行.最初是RDD,所以后来我使用sqlcontext.createDataFrame

I have a spark data frame which has around 458MM rows. It was initially an RDD so then I converted to spark data frame using sqlcontext.createDataFrame

RDD的前几行如下:

First few rows of RDD are as follows:

sorted_rdd.take(5)
Out[25]:
[(353, 21, u'DLR_Where Dreams Come True Town Hall', 0, 0.896152913570404),
 (353, 2, u'DLR_Leading at a Higher Level', 1, 0.7186800241470337),
 (353,
  220,
  u'DLR_The Year of a Million Dreams Leadership Update',
  0,
  0.687175452709198),
 (353, 1, u'DLR_Challenging Conversations', 1, 0.6632049083709717),
 (353,
  0,
  u'DLR_10 Keys to Inspiring, Engaging, and Energizing Your People',
  1,
  0.647541344165802)]

我将其保存到如下数据框中

I save it into data frame as below

sorted_df=sqlContext.createDataFrame(sorted_rdd,['user','itemId','itemName','Original','prediction'])

最后将其保存如下:

sorted_df.write.parquet("predictions_df.parquet") 

我正在将Spark与Yarn一起使用,它具有50个10g执行器和5个内核.write命令保持运行一个小时,但文件尚未保存.

I am using Spark with Yarn having 50 executors of 10g each and 5 cores. The write command keeps running for an hour and still the file is not saved yet.

是什么让它变慢了?

推荐答案

我可以尝试尝试两件事:

Two things I can think of to try:

  1. 您可能要检查您拥有的分区数.如果分区太少,那么您将无法获得所需的并行性.

  1. You might want to check the number of partitions you have. If you have too few partitions then you don't get the required parallelism.

Spark懒惰地执行任务.这意味着可能写得很快,但是为了达到目的而进行的计算却很慢.您可以尝试做的是缓存数据框(并执行一些操作,例如对它进行计数以确保它实现),然后尝试再次写入.如果现在节省的很快,那么问题就出在计算上,而不是实木复合地板的书写上.

Spark does its stuff lazily. This means that it could be that the writing is very fast but the calculation in order to get to it is slow. What you can try to do is cache the dataframe (and perform some action such as count on it to make sure it materializes) and then try to write again. If the saving part is fast now then the problem is with the calculation and not the parquet writing.

这篇关于将DataFrame保存到Parquet需要花费大量时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆