什么是使用&QUOT时星火生成的文件; saveAsTextFile"? [英] What are the files generated by Spark when using "saveAsTextFile"?

查看:215
本文介绍了什么是使用&QUOT时星火生成的文件; saveAsTextFile"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行一个Spark工作及其使用方法saveAsTextFile作为<指定输出保存为文本文件href=\"https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD\">https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD

When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :

下面是所创建的文件:

是.crc文件循环冗余校验文件?所以用来检查每个生成文件的内容是正确的?

Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?

该_SUCCESS文件始终是空的,那是什么意思?

The _SUCCESS file is always empty, what does this signify ?

这是没有在上面的截图扩展名的文件包含来自RDD的实际数据,但为什么产生的,而不是只是一个多文件?

The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?

推荐答案

这些都是由基础Hadoop的API火花电话时调用生成的文件 saveAsTextFile()

Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().


  • 部分 - 文件:这些是你的输出数据文件。

  • part- files: These are your output data files.

您将有一个部分 - 每个分区文件中的RDD你叫 saveAsTextFile()上。每个文件将并行写出,达到一定的极限(通常,对工人芯在集群的数目)。这意味着你会写你的输出速度更快,它会被写出来,如果​​都放在一个单一的文件,假设你的存储层可以处理的带宽。

You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.

您可以查看在您的RDD分区数量,这应该告诉你有多少部分 - 文件预期,如下所示:

You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:

# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()


  • _SUCCESS 文件:的presence空 _SUCCESS 文件只意味着操作正常完成。

  • _SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.

    .crc 文件:我还没有看到 .crc 文件之前,但没错,presumably他们是在部分的检查 - 文件

    .crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.

    这篇关于什么是使用&QUOT时星火生成的文件; saveAsTextFile&QUOT;?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆