文件何时“可拆分"? [英] When are files "splittable"?

查看：21 发布时间：2021/12/15 19:19:51 hadoop apache-spark hive hdfs file-format

本文介绍了文件何时“可拆分"?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我使用 spark 时，我有时会遇到 HIVE 表中的一个大文件，有时我试图处理一个 HIVE 表中的许多小文件.

我了解在调整 Spark 作业时，其工作方式取决于文件是否可拆分.在

编辑 1:看看这个 SE 问题和这个工作代码在 Spark 上读取 zip 文件.

JavaPairRDDfileNameContentsRDD = javaSparkContext.wholeTextFiles(args[0]);JavaRDDlineCounts = fileNameContentsRDD.map(new Function, String>() {@覆盖public String call(Tuple2 fileNameContent) 抛出异常 {字符串内容 = fileNameContent._2();int numLines = content.split("[
]+").length;返回 fileNameContent._1() + ":" + numLines;}});列表<字符串>输出 = lineCounts.collect();

编辑 2:

LZO 文件可以拆分.

<块引用>

LZO 文件可以被分割，只要分割发生在块边界上

参考这个文章了解更多详情.

When I'm using spark, I sometimes run into one huge file in a HIVE table, and I sometimes am trying to process many smaller files in a HIVE table.

I understand that when tuning spark jobs, how it works depends on whether or not the files are splittable. In this page from cloudera, it says that we should be aware of whether or not the files are splittable:

...For example, if your data arrives in a few large unsplittable files...

How do I know if my file is splittable?
How do I know the number of partitions to use if the file is splittable ?
Is it better to err on the side of more partitions if I'm trying to write a piece of code that will work on any HIVE table, i.e. either of the two cases described above?

解决方案

Considering Spark accepts Hadoop input files, have a look at below image.

Only bzip2 formatted files are splitable and other formats like zlib, gzip, LZO, LZ4 and Snappy formats are not splitable.

Regarding your query on partition, partition does not depend on file format you are going to use. It depends on content in the file - Values of partitioned column like date etc.

EDIT 1: Have a look at this SE question and this working code on Spark reading zip file.

JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(args[0]);
        JavaRDD<String> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String>() {
            @Override
            public String call(Tuple2<String, String> fileNameContent) throws Exception {
                String content = fileNameContent._2();
                int numLines = content.split("[
]+").length;
                return fileNameContent._1() + ":  " + numLines;
            }
        });
        List<String> output = lineCounts.collect();

EDIT 2:

LZO files can be splittable.

LZO files can be split as long as the splits occur on block boundaries

Refer to this article for more details.

这篇关于文件何时“可拆分"?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

文件何时“可拆分"? [英] When are files "splittable"?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

文件何时“可拆分"? [英] When are files &quot;splittable&quot;?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

文件何时“可拆分"? [英] When are files "splittable"?

登录关闭