在Spark中处理大型压缩文件 [英] Dealing with a large gzipped file in Spark

查看:556
本文介绍了在Spark中处理大型压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个较大的(压缩后约为85 GB)来自s3的压缩文件,我正在尝试在AWS EMR上使用Spark处理(现在有一个m4.xlarge主实例和两个m4.10xlarge核心实例,每个实例具有100 GB EBS数量).我知道gzip是不可拆分的文件格式,并且 建议应该重新分区压缩文件,因为Spark最初提供了一个分区的RDD.但是,做完

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing

scala> val raw = spark.read.format("com.databricks.spark.csv").
     | options(Map("delimiter" -> "\\t", "codec" -> "org.apache.hadoop.io.compress.GzipCodec")).
     | load("s3://path/to/file.gz").
     | repartition(sc.defaultParallelism * 3)
raw: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: string, _c1: string ... 48 more fields
scala> raw.count()

并查看Spark应用程序用户界面,我仍然仅看到一个执行者(其中14个人已死亡)执行一项任务,并且该工作从未完成(或者至少我没有等待足够长的时间才能完成) ).

and taking a look at the Spark application UI, I still see only one active executor (the other 14 are dead) with one task, and the job never finishes (or at least I've not waited long enough for it to).

  • 这是怎么回事?有人可以帮助我了解此示例中Spark的工作方式吗?
  • 我应该使用其他群集配置吗?
  • 不幸的是,我无法控制压缩模式,但是还有其他处理此类文件的方法吗?

推荐答案

如果文件格式不可拆分,则无法避免在一个内核上完整读取文件.为了并行化工作,您必须知道如何将工作分配给不同的计算机.在gzip情况下,假设您将其分为128M个大块.第n个块依赖于第n-1个块的位置信息来知道如何解压缩,这取决于第n-2-nd个块,依此类推直到第一个.

If the file format is not splittable, then there's no way to avoid reading the file in its entirety on one core. In order to parallelize work, you have to know how to assign chunks of work to different computers. In the gzip case, suppose you divide it up into 128M chunks. The nth chunk depends on the n-1-th chunk's position information to know how to decompress, which depends on the n-2-nd chunk, and so on down to the first.

如果要并行化,则需要使该文件可拆分.一种方法是将其解压缩并解压缩,也可以解压缩,将其拆分为多个文件(每个所需的并行任务一个文件),然后对每个文件进行gzip压缩.

If you want to parallelize, you need to make this file splittable. One way is to unzip it and process it uncompressed, or you can unzip it, split it into several files (one file for each parallel task you want), and gzip each file.

这篇关于在Spark中处理大型压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆