在 Spark 中处理大的 gzip 压缩文件 [英] Dealing with a large gzipped file in Spark

查看:69
本文介绍了在 Spark 中处理大的 gzip 压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自 s3 的大(大约 85 GB 压缩)gzipped 文件,我试图在 AWS EMR 上使用 Spark 处理它(现在有一个 m4.xlarge 主实例和两个 m4.10xlarge 核心实例,每个实例都有 100 GBEBS 体积).我知道 gzip 是一种不可拆分的文件格式,并且 看过 a> 建议应该对压缩文件重新分区,因为Spark 最初给出了一个带有一个分区的 RDD.然而,做完之后

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing

scala> val raw = spark.read.format("com.databricks.spark.csv").
     | options(Map("delimiter" -> "\t", "codec" -> "org.apache.hadoop.io.compress.GzipCodec")).
     | load("s3://path/to/file.gz").
     | repartition(sc.defaultParallelism * 3)
raw: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: string, _c1: string ... 48 more fields
scala> raw.count()

查看 Spark 应用程序 UI,我仍然只看到一个活动的 executor(其他 14 个已死)执行一项任务,并且该作业从未完成(或者至少我没有等待足够长的时间让它).

and taking a look at the Spark application UI, I still see only one active executor (the other 14 are dead) with one task, and the job never finishes (or at least I've not waited long enough for it to).

  • 这是怎么回事?有人能帮我理解 Spark 在这个例子中是如何工作的吗?
  • 我应该使用不同的集群配置吗?
  • 不幸的是,我无法控制压缩模式,但是有没有其他方法可以处理这样的文件?

推荐答案

如果文件格式不可拆分,则无法避免在一个内核上完整读取文件.为了并行化工作,您必须知道如何将工作块分配给不同的计算机.在 gzip 的情况下,假设您将其分成 128M 块.第n个chunk依赖第n-1-th个chunk的位置信息知道如何解压,解压依赖n-2-nd个chunk,依此类推直到第一个.

If the file format is not splittable, then there's no way to avoid reading the file in its entirety on one core. In order to parallelize work, you have to know how to assign chunks of work to different computers. In the gzip case, suppose you divide it up into 128M chunks. The nth chunk depends on the n-1-th chunk's position information to know how to decompress, which depends on the n-2-nd chunk, and so on down to the first.

如果要并行化,则需要将此文件设为可拆分.一种方法是将其解压缩并进行解压缩处理,或者您可以解压缩它,将其拆分为多个文件(每个并行任务对应一个文件),然后对每个文件进行 gzip.

If you want to parallelize, you need to make this file splittable. One way is to unzip it and process it uncompressed, or you can unzip it, split it into several files (one file for each parallel task you want), and gzip each file.

这篇关于在 Spark 中处理大的 gzip 压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆