Mapreduce XML输入格式-建立自定义格式 [英] Mapreduce XML input format - to build custom format

查看：84 发布时间：2020/5/5 15:50:44 hadoop xml-parsing mapreduce

本文介绍了Mapreduce XML输入格式-建立自定义格式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果输入文件采用XML格式，则我不应该使用TextInputFormat，因为TextInputFormat假定每条记录都在输入文件的每一行中，并且Mapper类针对每一行被调用以获取该记录的键值对/线.

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.

因此，我认为我们需要一种自定义输入格式来扫描XML数据集.

So I think we need a custom input format to scan the XML datasets.

作为Hadoop mapreduce的新手，是否有文章/链接/视频显示了构建自定义输入格式的步骤?

Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?

谢谢娜特

推荐答案

问题在MapReduce中并行处理单个XML文件非常棘手，因为XML不包含其数据格式的同步标记.因此，如何处理XML固有的不可拆分文件格式?

Problem Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?

解决方案 MapReduce不包含对XML的内置支持，因此我们不得不转向另一个Apache项目Mahout，这是一个提供XML InputFormat的机器学习系统.

Solution MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.

因此，我的意思是自从Mahout库存在以来，就不需要自定义输入格式. 我不确定您是要读还是写，但上面的链接中都对它们进行了描述.

So I mean no need to have custom input format since Mahout library present. I am not sure, whether you are going to read or write but both were described in above link.

请看此外，XmlInputFormat extends TextInputFormat

这篇关于Mapreduce XML输入格式-建立自定义格式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Mapreduce XML输入格式-建立自定义格式 [英] Mapreduce XML input format - to build custom format

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Mapreduce XML输入格式-建立自定义格式 [英] Mapreduce XML input format - to build custom format

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭