您如何控制输出文件的大小? [英] How do you control the size of the output file?

查看:133
本文介绍了您如何控制输出文件的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在spark中,控制输出文件的文件大小的最佳方法是什么.例如,在log4j中,我们可以指定最大文件大小,然后旋转文件.

In spark, what is the best way to control file size of the output file. For example, in log4j, we can specify max file size, after which the file rotates.

我正在寻找镶木地板文件的类似解决方案.写入文件时有最大文件大小选项可用吗?

I am looking for similar solution for parquet file. Is there a max file size option available when writing a file?

我的解决方法很少,但是没有一个很好.如果我想将文件限制为64mb,那么一种选择是重新分区数据并写入临时位置.然后使用临时位置中的文件大小将文件合并在一起.但是获取正确的文件大小很困难.

I have few workarounds, but none is good. If I want to limit files to 64mb, then One option is to repartition the data and write to temp location. And then merge the files together using the file size in the temp location. But getting the correct file size is difficult.

推荐答案

Spark无法控制Parquet文件的大小,因为内存中的DataFrame在写入磁盘之前需要进行编码和压缩.在此过程完成之前,无法估算磁盘上的实际文件大小.

It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before writing to disks. Before this process finishes, there is no way to estimate the actual file size on disk.

所以我的解决方法是:

  • 将DataFrame写入HDFS,df.write.parquet(path)
  • 获取目录大小并计算文件数

  • Write the DataFrame to HDFS, df.write.parquet(path)
  • Get the directory size and calculate the number of files

val fs = FileSystem.get(sc.hadoopConfiguration)
val dirSize = fs.getContentSummary(path).getLength
val fileNum = dirSize/(512 * 1024 * 1024)  // let's say 512 MB per file

  • 读取目录并重新写入HDFS

  • Read the directory and re-write to HDFS

    val df = sqlContext.read.parquet(path)
    df.coalesce(fileNum).write.parquet(another_path)
    

    请勿重复使用原始的df,否则它将触发您的作业两次.

    Do NOT reuse the original df, otherwise it will trigger your job two times.

    删除旧目录,然后重新命名为新目录

    Delete the old directory and rename the new directory back

    fs.delete(new Path(path), true)
    fs.rename(new Path(newPath), new Path(path))
    

  • 此解决方案的缺点是需要两次写入数据,这使磁盘IO翻了一番,但是目前这是唯一的解决方案.

    This solution has a drawback that it needs to write the data two times, which doubles disk IO, but for now this is the only solution.

    这篇关于您如何控制输出文件的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆