在 Spark 中处理大的 gzip 压缩文件 [英] Dealing with a large gzipped file in Spark

查看：69 发布时间：2021/11/27 10:06:36 apache-spark gzip amazon-emr

本文介绍了在 Spark 中处理大的 gzip 压缩文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个来自 s3 的大(大约 85 GB 压缩)gzipped 文件，我试图在 AWS EMR 上使用 Spark 处理它(现在有一个 m4.xlarge 主实例和两个 m4.10xlarge 核心实例，每个实例都有 100 GBEBS 体积).我知道 gzip 是一种不可拆分的文件格式，并且我看过它a> 建议应该对压缩文件重新分区，因为Spark 最初给出了一个带有一个分区的 RDD.然而，做完之后

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing

scala> val raw = spark.read.format("com.databricks.spark.csv").
     | options(Map("delimiter" -> "\t", "codec" -> "org.apache.hadoop.io.compress.GzipCodec")).
     | load("s3://path/to/file.gz").
     | repartition(sc.defaultParallelism * 3)
raw: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: string, _c1: string ... 48 more fields
scala> raw.count()

查看 Spark 应用程序 UI，我仍然只看到一个活动的 executor(其他 14 个已死)执行一项任务，并且该作业从未完成(或者至少我没有等待足够长的时间让它).

and taking a look at the Spark application UI, I still see only one active executor (the other 14 are dead) with one task, and the job never finishes (or at least I've not waited long enough for it to).

这是怎么回事?有人能帮我理解 Spark 在这个例子中是如何工作的吗?
我应该使用不同的集群配置吗?
不幸的是，我无法控制压缩模式，但是有没有其他方法可以处理这样的文件?

在 Spark 中处理大的 gzip 压缩文件 [英] Dealing with a large gzipped file in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Spark 中处理大的 gzip 压缩文件 [英] Dealing with a large gzipped file in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭