Spark中止火花作业时打开的文件太多 [英] Too many open files in spark aborting spark job

查看：279 发布时间：2020/8/23 2:37:56 apache-spark amazon-s3 apache-spark-sql hadoop2 amazon-emr

本文介绍了Spark中止火花作业时打开的文件太多的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的应用程序中，我正在读取40 GB的文本文件，该文件总共分布在188个文件中. 我分割了这些文件，并使用rdd对在火花中每行创建了xml文件. 对于40 GB的输入，它将创建数百万个小型xml文件，这是我的要求. 一切正常，但是当spark将文件保存在S3中时，它将引发错误，并且作业失败.

In my application i am reading 40 GB text files that is totally spread across 188 files . I split this files and create xml files per line in spark using pair rdd . For 40 GB of input it will create many millions small xml files and this is my requirement. All working fine but when spark saves files in S3 it throws error and job fails .

这是我得到的例外

由以下原因引起:java.nio.file.FileSystemException: /mnt/s3/emrfs-2408623010549537848/0000000000:在打开的文件过多 sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) 在 sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 在 sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 在 sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) 在java.nio.file.Files.newByteChannel(Files.java:361)在 java.nio.file.Files.createFile(Files.java:632)在 com.amazon.ws.emr.hadoop.fs.files.TemporaryFiles.create(TemporaryFiles.java:70) 在 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.openNewPart(MultipartUploadOutputStream.java:493) ...还有21个

Caused by: java.nio.file.FileSystemException: /mnt/s3/emrfs-2408623010549537848/0000000000: Too many open files at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.createFile(Files.java:632) at com.amazon.ws.emr.hadoop.fs.files.TemporaryFiles.create(TemporaryFiles.java:70) at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.openNewPart(MultipartUploadOutputStream.java:493) ... 21 more

ApplicationMaster主机:10.97.57.198 ApplicationMaster RPC端口:0 队列:默认开始时间:1542344243252最终状态:FAILED
跟踪网址: http:////ip-10-97-57-234.tr-fr-nonprod.aws-int.thomsonreuters.com:20888/proxy/application_1542343091900_0001/ 用户:线程"main"中的hadoop异常 org.apache.spark.SparkException:应用程序 application_1542343091900_0001完成，状态为失败

ApplicationMaster host: 10.97.57.198 ApplicationMaster RPC port: 0 queue: default start time: 1542344243252 final status: FAILED
tracking URL: http://ip-10-97-57-234.tr-fr-nonprod.aws-int.thomsonreuters.com:20888/proxy/application_1542343091900_0001/ user: hadoop Exception in thread "main" org.apache.spark.SparkException: Application application_1542343091900_0001 finished with failed status

还有

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: 请降低您的请求率. (服务:Amazon S3；状态代码: 503;错误代码:SlowDown;申请编号:D33581CA9A799F64； S3扩展要求编号: /SlEplo + lCKQRVVH + zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA =)， S3扩展请求ID: /SlEplo + lCKQRVVH + zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA =

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: D33581CA9A799F64; S3 Extended Request ID: /SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=), S3 Extended Request ID: /SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=

这是我的代码.

object TestAudit {

  def main(args: Array[String]) {


    val inputPath = args(0)
    val output = args(1)
    val noOfHashPartitioner = args(2).toInt

    //val conf = new SparkConf().setAppName("AuditXML").setMaster("local");
    val conf = new SparkConf().setAppName("AuditXML")

    val sc = new SparkContext(conf);
    val input = sc.textFile(inputPath)


    val pairedRDD = input.map(row => {
      val split = row.split("\\|")
      val fileName = split(0)
      val fileContent = split(1)
      (fileName, fileContent)
    })

    import org.apache.hadoop.io.NullWritable
    import org.apache.spark.HashPartitioner
    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

    class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
      override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
      override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
    }

    pairedRDD.partitionBy(new HashPartitioner(10000)).saveAsHadoopFile("s3://a205381-tr-fr-development-us-east-1-trf-auditabilty//AUDITOUTPUT", classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])

  }

}

即使我尝试不减少任何HashPartitioner，它也不起作用

Even i tried reducing no of HashPartitioner then also it does not work

推荐答案

Unix系统上的每个进程都有打开文件或文件描述符数量的限制.由于您的数据很大，并且分区到了子文件(在Spark内部)，因此您的过程遇到了限制和错误. 您可以为每个用户增加文件描述符的数量，如下所示:

Every process on Unix systems has a limitation of open files or number of file descriptors. As your data is large and partitions to subfile (in internal of Spark), your process meet the limitation and error. You can increase the number of file descriptors for each user as following:

编辑文件:/etc/security/limits.conf 并添加(或修改)

edit the file: /etc/security/limits.conf and add (or modify)

*         hard    nofile      500000
*         soft    nofile      500000
root      hard    nofile      500000
root      soft    nofile      500000

这将为每个用户以及root用户将 nofile (文件描述符的数量)功能设置为 500000 .

This will set the nofile (number of file descriptors) feature to 500000 for each user along with the root user.

重新启动后，更改将被应用.

After restarting the changes will be applied.

另外，有人可以通过设置 LimitNOFILE 来设置特殊进程的文件描述符数量.例如，如果您使用yarn来运行Spark作业，而Yarn守护程序将使用 systemd 启动，则可以将LimitNOFILE = 128000添加到Yarn systemd脚本(资源管理器和nodemanager)中，以设置以下项的Yarn进程号: 128000 的文件描述符.

Also, someone can set the number of file descriptors for a special process, by setting the LimitNOFILE. For example, if you use yarn to run Spark jobs and the Yarn daemon will be started using systemd, you can add LimitNOFILE=128000 to Yarn systemd script(resource manager and nodemanager) to set Yarn process number of file descriptors to 128000.

3 Methods to Change the Number of Open File Limit in Linux
Limits on the number of file descriptors

这篇关于Spark中止火花作业时打开的文件太多的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark中止火花作业时打开的文件太多 [英] Too many open files in spark aborting spark job

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark中止火花作业时打开的文件太多 [英] Too many open files in spark aborting spark job

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭