当文件无法放入 spark 的主内存时,spark 如何读取大文件(PB) [英] How spark read a large file (petabyte) when file can not be fit in spark's main memory

查看:48
本文介绍了当文件无法放入 spark 的主内存时,spark 如何读取大文件(PB)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这些情况下,大文件会发生什么?

What will happen for large files in these cases?

1) Spark 从 NameNode 获取数据的位置.根据 NameNode 的信息,Spark 会在同一时间因为数据太长而停止吗?

1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode?

2) Spark 根据数据节点块大小对数据进行分区,但所有数据无法存储到主内存中.这里我们没有使用 StorageLevel.那么这里会发生什么呢?

2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here?

3) Spark 对数据进行分区,一些数据会存储在主内存中,一旦主内存中的数据再次处理,spark 会从磁盘加载其他数据.

3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc.

推荐答案

首先,Spark 只在某个动作(如 countcollectwrite) 被调用.一旦调用了一个操作,Spark 就会加载分区中的数据——并发加载的分区数量取决于您可用的内核数量.所以在 Spark 中你可以认为 1 个分区 = 1 个核心 = 1 个任务.请注意,所有并发加载的分区都必须适合内存,否则会出现 OOM.

First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM.

假设您有多个阶段,Spark 然后仅在加载的分区上运行第一个阶段的转换.一旦它对加载的分区中的数据应用了转换,它就会将输出存储为 shuffle-data,然后读取更多的分区.然后它在这些分区上应用转换,将输出存储为 shuffle-data,读取更多分区等等,直到所有数据都被读取.

Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read.

如果你不应用任何转换而只做一个count,Spark 仍然会读入分区中的数据,但它不会在你的集群中存储任何数据,如果你执行 >count 再次读取所有数据.为了避免多次读取数据,您可以调用 cachepersist 在这种情况下,Spark 尝试将数据存储在您的集群中.在 cache(与 persist(StorageLevel.MEMORY_ONLY) 相同)上,它会将所有分区存储在内存中 - 如果它不适合内存,您将收到 OOM.如果你调用 persist(StorageLevel.MEMORY_AND_DISK) 它将尽可能多地存储在内存中,其余的将放在磁盘上.如果数据不适合磁盘,操作系统通常会杀死你工人.

If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.

请注意,Spark 有自己的小型内存管理系统.如果您调用 cachepersist,您分配给 Spark 作业的一些内存用于保存正在处理的数据,一些内存用于存储.

Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call cache or persist.

我希望这个解释有帮助:)

I hope this explanation helps :)

这篇关于当文件无法放入 spark 的主内存时,spark 如何读取大文件(PB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆