Apache Spark:文件批处理 [英] Apache Spark: batch processing of files

查看:112
本文介绍了Apache Spark:文件批处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HDFS上有目录,子目录设置,我想在将所有文件立即加载到内存之前对其进行预处理.我基本上拥有大文件(1MB),一旦处理,它们将更像1KB,然后执行sc.wholeTextFiles来开始我的分析

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis

如何在目录/子目录中的每个文件(*.xml)上循环,进行操作(以示例为例,保留第一行),然后将结果转储回HDFS(新文件,说.xmlr)吗?

How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?

推荐答案

我建议您只使用sc.wholeTextFiles并使用转换对其进行预处理,然后将所有转换另存为单个压缩序列文件(您可以请参阅我的指南以这样做: http://0x0fff.com/spark-hdfs-integration/)

I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)

另一种选择可能是编写一个mapreduce,一次处理整个文件,然后按照我之前的建议将它们保存到序列文件中:

Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it

在两种情况下,您几乎都将执行相同的操作,Spark和Hadoop都将启动一个进程(Spark任务或Hadoop映射器)来处理这些文件,因此通常,这两种方法都将使用相同的逻辑来工作.我建议您从Spark一个开始,因为考虑到您已经拥有一个带有Spark的集群,因此实现起来更简单

In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

这篇关于Apache Spark:文件批处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆