mapreduce拆分和火花分割之间的区别 [英] Difference between mapreduce split and spark paritition

查看:124
本文介绍了mapreduce拆分和火花分割之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问的是,在使用Hadoop/MapReduce和Spark时,数据分区是否有显着差异?它们都可以在HDFS(TextInputFormat)上工作,因此理论上应该相同.

I wanted to ask is there any significant difference in data partitioning when working with Hadoop/MapReduce and Spark? They both work on HDFS(TextInputFormat) so it should be same in theory.

在某些情况下,数据分区的过程可能会有所不同吗?任何见解对我的研究都将非常有帮助.

Are there any cases where the there procedure of data partitioning can differ? Any insights would be very helpful to my study.

谢谢

推荐答案

使用 Hadoop/mapreduce和Spark吗?

Is any significant difference in data partitioning when working with Hadoop/mapreduce and Spark?

Spark支持所有hadoop I/O格式,因为它使用相同的Hadoop InputFormat API 及其自身的格式化程序.因此, Spark输入分区的工作方式与默认情况下Hadoop/MapReduce输入拆分的工作方式相同.分区中的数据大小可以在运行时进行配置,并且提供repartitioncoalescerepartitionAndSortWithinPartition之类的转换,可以直接控制要计算的分区数.

Spark supports all hadoop I/O formats as it uses same Hadoop InputFormat APIs along with it's own formatters. So, Spark input partitions works same way as Hadoop/MapReduce input splits by default. Data size in a partition can be configurable at run time and It provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition can give you direct control over the number of partitions being computed.

在任何情况下,其数据分区过程都可以 不一样?

Are there any cases where their procedure of data partitioning can differ?

除Hadoop之外,I/O API Spark确实还有其他一些智能I/O格式(例如: Databricks CSV NoSQL DB连接器),它们将直接返回DataSet/DateFrame(在RDD之上的更高级的东西)是火花特定的.

Apart from Hadoop, I/O APIs Spark does have some other intelligent I/O Formats(Ex: Databricks CSV and NoSQL DB Connectors) which will directly return DataSet/DateFrame(more high-level things on top of RDD) which are spark specific.

从非Hadoop来源读取数据时,火花分区上的关键点

  • 分区的最大大小最终取决于连接器,
    • 对于S3,属性类似于fs.s3n.block.sizefs.s3.block.size.
    • Mongo属性为.
    • The maximum size of a partition is ultimately by the connectors,
      • for S3, the property is like fs.s3n.block.size or fs.s3.block.size.
      • Cassandra property is spark.cassandra.input.split.size_in_mb.
      • Mongo prop is, spark.mongodb.input.partitionerOptions.partitionSizeMB.

      了解详情. link1

      Read more.. link1

      这篇关于mapreduce拆分和火花分割之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆