内存数据存储在Apache Spark上下文中意味着什么? [英] What does in-memory data storage mean in the context of Apache Spark?

查看:76
本文介绍了内存数据存储在Apache Spark上下文中意味着什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读到Apache Spark在内存中存储数据.但是,Apache Spark旨在分析大量数据(也就是大数据分析).在这种情况下,内存中数据存储的真正含义是什么?它可以存储的数据是否受RAM限制?它的数据存储与使用HDFS的Apache Hadoop相比如何?

I have read that Apache Spark stores data in-memory. However, Apache Spark is meant for analyzing huge volumes of data (a.k.a big data analytics). In this context, what does in-memory data storage really mean? Is the data that it can store limited by the RAM available? How does its data storage compare with Apache Hadoop which uses HDFS?

推荐答案

在Hadoop中,数据在步骤之间持久存储在磁盘上,因此典型的多步骤作业最终看起来像这样:

In Hadoop, data is persisted to disk between steps, so a typical multi-step job ends up looking something like this:

hdfs -> read & map -> persist -> read & reduce -> hdfs -> read & map -> persist -> read and reduce -> hdfs

这是一个出色的设计,当您批量处理非常适合map-reduce模式的文件时,使用起来非常有意义.但是对于某些工作负载,这可能会非常慢-迭代算法尤其受到负面影响.您已经花了一些时间来创建一些数据结构(例如图表),并且在每个步骤中要做的就是更新得分.在磁盘上持久地读取和读取整个图形会减慢您的工作.

This is a brilliant design, and it makes perfect sense to use, when you're batch-processing files that fits the map-reduce pattern well. But for some workloads, this can be extremely slow - iterative algorithms are especially affected negatively. You've spent time creating some data structure (a graph for instance), and all you want to do in each step, is update a score. Persisting and reading the entire graph to/from disk will slow down your job.

Spark使用支持循环数据流的更通用的引擎,并且将尝试在工作步骤之间将内容保存在内存中.这就是说,如果您可以创建一个数据结构和分区策略,而数据不会在工作的每个步骤之间乱码,那么您就可以高效地更新数据,而无需在步骤之间进行序列化并将所有内容写入磁盘.这就是Spark为何在首页上显示图表,显示逻辑回归速度提高100倍的原因.

Spark uses a more general engine that supports cyclic data flows, and will try to keep things in memory in between job steps. What this means is, if you can create a data structure and partitioning strategy, where your data doesn't shuffle around between each step in your job, you can efficiently update it without serialising and writing everything to disk in between steps. That's the reason why Spark's got a chart on their front page showing a 100x speedup on logical regression.

如果您编写一个Spark作业,该作业仅从数据集中的每个输入行计算一个值,然后将其写回磁盘,则Hadoop和Spark的性能将几乎相等(Spark中的启动时间更快) ,但是当我们花几个小时在一个步骤中处理数据时,这并不重要.

If you write a Spark job that just computes a value from each input line in your dataset, and write that back to disk, Hadoop and Spark will be pretty much equal in terms of performance (start-up time is faster in Spark, but that hardly matters when we spend hours on processing data in a single step).

如果在两个步骤之间Spark无法将RDD保留在内存中,它将像Hadoop一样将其溢出到磁盘上.但是请记住,Spark不是灵丹妙药,在某些极端情况下,您将不得不应对Spark的内存内在性质导致内存不足的问题,而Hadoop会将这些内容全部写入磁盘.

If Spark cannot hold an RDD in memory in between steps, it will spill it to disk, much like Hadoop does. But remember that Spark isn't a silver bullet, and there will be corner cases where you'll have to fight Spark's in-memory nature causing OutOfMemory problems, where Hadoop would just write everything to disk.

我个人喜欢这样思考:在您的500个64GB计算机集群中,创建Hadoop的目的是通过分配磁盘读写来更快地批量处理500 TB作业. Spark利用这样的事实,即500 * 64GB = 32TB的内存可以完全解决内存中的许多其他问题!

I personally like to think of it this way: In your 500 64GB machines cluster, Hadoop is created to efficiently batch process your 500 TB job faster by distributing disk reads and writes. Spark utilises the fact that 500*64GB=32TB worth of memory can likely solve quite a few of your other problems entirely in-memory!

这篇关于内存数据存储在Apache Spark上下文中意味着什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆