如何使用Hadoop的InputFormats在Apache的火花? [英] How to use Hadoop InputFormats In Apache Spark?
问题描述
我在Hadoop的一个类 ImageInputFormat
从HDFS读取图像。如何使用我的InputFormat火花?
下面是我的 ImageInputFormat
:
公共类ImageInputFormat扩展FileInputFormat<文字,ImageWritable> { @覆盖
公共ImageRecordReader createRecordReader(InputSplit裂开,
TaskAttemptContext上下文)抛出IOException异常,InterruptedException的{
返回新ImageRecordReader();
} @覆盖
保护布尔isSplitable(JobContext背景下,路径文件名){
返回false;
}
}
的<一个href=\"http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext\">SparkContext有一个名为 hadoopFile
方法。它接受实现接口的类 org.apache.hadoop.ma pred.InputFormat
其描述说:获取RDD与任意的InputFormat Hadoop的文件。
也有看<一个href=\"https://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#hadoop-datasets\">Spark文档。
I have a class ImageInputFormat
in Hadoop which reads images from HDFS. How to use my InputFormat in Spark?
Here is my ImageInputFormat
:
public class ImageInputFormat extends FileInputFormat<Text, ImageWritable> {
@Override
public ImageRecordReader createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
return new ImageRecordReader();
}
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
The SparkContext has a method called hadoopFile
. It accepts classes implementing the interface org.apache.hadoop.mapred.InputFormat
Its description says "Get an RDD for a Hadoop file with an arbitrary InputFormat".
Also have a look at the Spark Documentation.
这篇关于如何使用Hadoop的InputFormats在Apache的火花?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!