火花中的RDD是什么? [英] What is RDD in spark
问题描述
RDD是不可变的分布式对象集合
我不太明白这是什么意思。它是否像存储在硬盘上的数据(分区对象)如果是这样,那么RDD如何具有用户定义的类(如java,scala或python)?
从这个链接: https://www.safaribooksonline.com/library/view /learning-spark/9781449359034/ch03.html 它提到:
用户通过两种方式创建RDD:通过加载外部数据集或
在其
驱动程序中分发对象集合(例如,列表或集合)
我对RDD的总体理解以及spark和hadoop的理解非常困惑。
有人可以帮忙。
RDD本质上是一组数据的Spark表示,它分布在多台机器上,并带有API让您可以对其执行操作。 RDD可以来自任何数据源,例如文本文件,通过JDBC的数据库等。
正式的定义是:
lockquote
< RDD是容错的并行数据结构,可让用户
明确地在内存中保留中间结果,控制其
分区以优化数据放置,并使用
丰富的操作符来操作它们。
如果您想了解RDD的详细信息,请阅读其中一个核心Spark学术论文弹性分布式数据集:用于内存集群计算的容错抽象
Definition says:
RDD is immutable distributed collection of objects
I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)
From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.html It mentions:
Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program
I am really confused understanding RDD in general and in relation to spark and hadoop.
Can some one please help.
An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
The formal definition is:
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
这篇关于火花中的RDD是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!