如何在Apache Spark中拆分输入文件 [英] How to split the input file in Apache Spark

查看:151
本文介绍了如何在Apache Spark中拆分输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个大小为100MB的输入文件.它包含大量CSV格式的点(长对).为了在Apache Spark中将输入文件分割成10个10MB的文件,我应该怎么做?或者如何自定义分割.

Suppose I have an input file of size 100MB. It contains large number of points (lat-long pair) in CSV format. What should I do in order to split the input file in 10 10MB files in Apache Spark or how do I customize the split.

注意:我要处理每个映射器中的点的子集.

Note: I want to process a subset of the points in each mapper.

推荐答案

Spark的抽象没有提供明确的数据拆分.但是,您可以通过几种方式控制并行性.

Spark's abstraction doesn't provide explicit split of data. However you can control the parallelism in several ways.

假设您使用YARN,HDFS文件会自动拆分为HDFS块,并且在运行Spark动作时会同时处理它们.

Assuming you use YARN, HDFS file is automatically split into HDFS blocks and they're processed concurrently when Spark action is running.

除HDFS并行性外,请考虑将Pairer与DRD一起使用. PairRDD是键-值对的RDD的数据类型,分区器管理从键到分区的映射.默认分区程序读取spark.default.parallelism.分区器有助于控制数据的分布及其在PairRDD特定操作(例如reduceByKey)中的位置.

Apart from HDFS parallelism, consider using partitioner with PairRDD. PairRDD is data type of RDD of key-value pairs and a partitioner manages mapping from a key to a partition. Default partitioner reads spark.default.parallelism. The partitioner helps to control the distribution of data as well as its locality in PairRDD-specific actions, e.g., reduceByKey.

看看以下有关Spark数据并行性的文档.

Take a look at following documentation about Spark data parallelism.

http://spark.apache.org/docs/1.2.0/tuning.html

这篇关于如何在Apache Spark中拆分输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆