Mapreduce XML输入格式-建立自定义格式 [英] Mapreduce XML input format - to build custom format
问题描述
如果输入文件采用XML格式,则我不应该使用TextInputFormat,因为TextInputFormat假定每条记录都在输入文件的每一行中,并且Mapper类针对每一行被调用以获取该记录的键值对/线.
If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
因此,我认为我们需要一种自定义输入格式来扫描XML数据集.
So I think we need a custom input format to scan the XML datasets.
作为Hadoop mapreduce的新手,是否有文章/链接/视频显示了构建自定义输入格式的步骤?
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
谢谢 娜特
推荐答案
问题 在MapReduce中并行处理单个XML文件非常棘手,因为XML不包含其数据格式的同步标记.因此,如何处理XML固有的不可拆分文件格式?
Problem Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
解决方案 MapReduce不包含对XML的内置支持,因此我们不得不转向另一个Apache项目Mahout,这是一个提供XML InputFormat的机器学习系统.
因此,我的意思是自从Mahout库存在以来,就不需要自定义输入格式. 我不确定您是要读还是写,但上面的链接中都对它们进行了描述.
So I mean no need to have custom input format since Mahout library present. I am not sure, whether you are going to read or write but both were described in above link.
请看此外,XmlInputFormat extends TextInputFormat
这篇关于Mapreduce XML输入格式-建立自定义格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!