在Spark中处理大型压缩文件 [英] Dealing with a large gzipped file in Spark

查看：556 发布时间：2020/8/23 2:07:04 apache-spark gzip amazon-emr

本文介绍了在Spark中处理大型压缩文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个较大的(压缩后约为85 GB)来自s3的压缩文件，我正在尝试在AWS EMR上使用Spark处理(现在有一个m4.xlarge主实例和两个m4.10xlarge核心实例，每个实例具有100 GB EBS数量).我知道gzip是不可拆分的文件格式，并且它建议应该重新分区压缩文件，因为Spark最初提供了一个分区的RDD.但是，做完

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing

scala> val raw = spark.read.format("com.databricks.spark.csv").
     | options(Map("delimiter" -> "\\t", "codec" -> "org.apache.hadoop.io.compress.GzipCodec")).
     | load("s3://path/to/file.gz").
     | repartition(sc.defaultParallelism * 3)
raw: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: string, _c1: string ... 48 more fields
scala> raw.count()

并查看Spark应用程序用户界面，我仍然仅看到一个执行者(其中14个人已死亡)执行一项任务，并且该工作从未完成(或者至少我没有等待足够长的时间才能完成) ).

and taking a look at the Spark application UI, I still see only one active executor (the other 14 are dead) with one task, and the job never finishes (or at least I've not waited long enough for it to).

这是怎么回事?有人可以帮助我了解此示例中Spark的工作方式吗?
我应该使用其他群集配置吗?
不幸的是，我无法控制压缩模式，但是还有其他处理此类文件的方法吗?

在Spark中处理大型压缩文件 [英] Dealing with a large gzipped file in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中处理大型压缩文件 [英] Dealing with a large gzipped file in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭