hadoop作业以拆分xml文件 [英] hadoop job to split xml files

查看:99
本文介绍了hadoop作业以拆分xml文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要处理1000个文件.每个文件包含1000个串联在一起的XML文件.

I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.

我想使用Hadoop分别拆分每个XML文件.使用Hadoop做到这一点的一种好方法是什么?

I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?

注意:我是Hadoop新手.我打算使用Amazon EMR.

NOTES: I am total Hadoop newbie. I plan on using Amazon EMR.

推荐答案

查看

Check out Mahout's XmlInputFormat. It's a shame that this is in Mahout and not in the core distribution.

连接的XML文件是否至少具有相同的格式?如果是这样,则将START_TAG_KEYEND_TAG_KEY设置为每个文件的根目录.每个文件将在map中显示为一个Text记录.然后,您可以使用自己喜欢的Java XML解析器来完成工作.

Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY and END_TAG_KEY to the root in each of your files. Each file will show up as one Text record in the map. Then, you can use your favorite Java XML parser to finish the job.

这篇关于hadoop作业以拆分xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆