Mapreduce XML输入格式-建立自定义格式 [英] Mapreduce XML input format - to build custom format

查看:84
本文介绍了Mapreduce XML输入格式-建立自定义格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果输入文件采用XML格式,则我不应该使用TextInputFormat,因为TextInputFormat假定每条记录都在输入文件的每一行中,并且Mapper类针对每一行被调用以获取该记录的键值对/线.

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.

因此,我认为我们需要一种自定义输入格式来扫描XML数据集.

So I think we need a custom input format to scan the XML datasets.

作为Hadoop mapreduce的新手,是否有文章/链接/视频显示了构建自定义输入格式的步骤?

Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?

谢谢 娜特

推荐答案

问题 在MapReduce中并行处理单个XML文件非常棘手,因为XML不包含其数据格式的同步标记.因此,如何处理XML固有的不可拆分文件格式?

Problem Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?

解决方案 MapReduce不包含对XML的内置支持,因此我们不得不转向另一个Apache项目Mahout,这是一个提供XML InputFormat的机器学习系统.

Solution MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.

因此,我的意思是自从Mahout库存在以来,就不需要自定义输入格式. 我不确定您是要读还是写,但上面的链接中都对它们进行了描述.

So I mean no need to have custom input format since Mahout library present. I am not sure, whether you are going to read or write but both were described in above link.

请看此外,XmlInputFormat extends TextInputFormat

这篇关于Mapreduce XML输入格式-建立自定义格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆