Apache Spark:文件批处理 [英] Apache Spark: batch processing of files

查看：112 发布时间：2020/11/22 2:34:42 batch-file hadoop apache-spark hdfs

本文介绍了Apache Spark:文件批处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在HDFS上有目录，子目录设置，我想在将所有文件立即加载到内存之前对其进行预处理.我基本上拥有大文件(1MB)，一旦处理，它们将更像1KB，然后执行sc.wholeTextFiles来开始我的分析

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis

如何在目录/子目录中的每个文件(*.xml)上循环，进行操作(以示例为例，保留第一行)，然后将结果转储回HDFS(新文件，说.xmlr)吗?

How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?

推荐答案

我建议您只使用sc.wholeTextFiles并使用转换对其进行预处理，然后将所有转换另存为单个压缩序列文件(您可以请参阅我的指南以这样做: http://0x0fff.com/spark-hdfs-integration/)

I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)

另一种选择可能是编写一个mapreduce，一次处理整个文件，然后按照我之前的建议将它们保存到序列文件中:

Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it

在两种情况下，您几乎都将执行相同的操作，Spark和Hadoop都将启动一个进程(Spark任务或Hadoop映射器)来处理这些文件，因此通常，这两种方法都将使用相同的逻辑来工作.我建议您从Spark一个开始，因为考虑到您已经拥有一个带有Spark的集群，因此实现起来更简单

In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

这篇关于Apache Spark:文件批处理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark:文件批处理 [英] Apache Spark: batch processing of files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark:文件批处理 [英] Apache Spark: batch processing of files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭