你如何控制输出文件的大小? [英] How do you control the size of the output file?

查看:41
本文介绍了你如何控制输出文件的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在spark中,控制输出文件大小的最佳方法是什么.例如,在 log4j 中,我们可以指定最大文件大小,之后文件会旋转.

In spark, what is the best way to control file size of the output file. For example, in log4j, we can specify max file size, after which the file rotates.

我正在为镶木地板文件寻找类似的解决方案.写入文件时是否有可用的最大文件大小选项?

I am looking for similar solution for parquet file. Is there a max file size option available when writing a file?

我的解决方法很少,但没有一个是好的.如果我想将文件限制为 64mb,那么一种选择是重新分区数据并写入临时位置.然后使用临时位置中的文件大小将文件合并在一起.但是获得正确的文件大小很困难.

I have few workarounds, but none is good. If I want to limit files to 64mb, then One option is to repartition the data and write to temp location. And then merge the files together using the file size in the temp location. But getting the correct file size is difficult.

推荐答案

Spark 无法控制 Parquet 文件的大小,因为内存中的 DataFrame 需要在写入磁盘之前进行编码和压缩.在此过程完成之前,无法估计磁盘上的实际文件大小.

It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before writing to disks. Before this process finishes, there is no way to estimate the actual file size on disk.

所以我的解决方案是:

  • 将DataFrame写入HDFS,df.write.parquet(path)
  • 获取目录大小并计算文件数

  • Write the DataFrame to HDFS, df.write.parquet(path)
  • Get the directory size and calculate the number of files

val fs = FileSystem.get(sc.hadoopConfiguration)
val dirSize = fs.getContentSummary(path).getLength
val fileNum = dirSize/(512 * 1024 * 1024)  // let's say 512 MB per file

  • 读取目录并重新写入HDFS

  • Read the directory and re-write to HDFS

    val df = sqlContext.read.parquet(path)
    df.coalesce(fileNum).write.parquet(another_path)
    

    不要重复使用原来的df,否则它会触发你的工作两次.

    Do NOT reuse the original df, otherwise it will trigger your job two times.

    删除旧目录并重命名新目录

    Delete the old directory and rename the new directory back

    fs.delete(new Path(path), true)
    fs.rename(new Path(newPath), new Path(path))
    

  • 这个方案有一个缺点,就是需要写两次数据,磁盘IO翻倍,但目前这是唯一的方案.

    This solution has a drawback that it needs to write the data two times, which doubles disk IO, but for now this is the only solution.

    这篇关于你如何控制输出文件的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆