解析数以百万计的小型XML文件 [英] Parsing millions of small XML files
问题描述
我有1000万个小XML文件(300KB-500KB)。我在Mapreduce中使用Mahaout的XML输入格式来读取数据,并使用SAX解析器进行解析。但处理速度非常慢。使用输入文件的压缩(lzo)会提高性能吗?每个文件夹都包含80-90k xml文件,当我启动该进程时,它会为每个文件运行映射器。有什么方法可以减少映射器?
您可以按照以下三种方法之一引用 li> Hadoop存档文件(HAR) 我发现第1条< a>和第2条,其中列出了多个解决方案(我从这些文章中删除了一些非通用替代品): I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers? You can follow one of the three approaches as quoted in this article: I have found article 1 and article 2, which list multiple solutions (I have removed some non-generic alternatives from these articles):
这篇关于解析数以百万计的小型XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
CombineFileInputFormat
: CombineFileInputFormat
是Hadoop提供的抽象类,它在MapReduce读取时合并小文件。合并的文件不会保存到磁盘。相反,该过程读取多个文件并将它们即时合并以供单个映射任务使用。
CombineFileInputFormat
: The CombineFileInputFormat
is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them "on the fly" for consumption by a single map task.