解析数以百万计的小型XML文件 [英] Parsing millions of small XML files

查看:132
本文介绍了解析数以百万计的小型XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有1000万个小XML文件(300KB-500KB)。我在Mapreduce中使用Mahaout的XML输入格式来读取数据,并使用SAX解析器进行解析。但处理速度非常慢。使用输入文件的压缩(lzo)会提高性能吗?每个文件夹都包含80-90k xml文件,当我启动该进程时,它会为每个文件运行映射器。有什么方法可以减少映射器?

解决方案

您可以按照以下三种方法之一引用 li> Hadoop存档文件(HAR)

  • 序列文件
  • HBase

    我发现第1条< a>和第2条,其中列出了多个解决方案(我从这些文章中删除了一些非通用替代品):


    1. 更改提取过程/间隔 :更改源代码级别的逻辑以减少大量小文件并尝试生成少量大文件 批处理文件整合:小文件不可避免时,文件整合是最常见的解决方案。使用此选项,您将定期运行一个简单的整合MapReduce作业,以读取文件夹中的所有小文件并将其重写为更少的较大文件。 序列文件:当需要维护原始文件名时,一种非常常见的方法是使用序列文件。在此解决方案中,文件名作为密钥存储在序列文件中,文件内容以值存储为
    2. HBase :而不是将文件写入磁盘,将文件写入HBase内存存储。使用 CombineFileInputFormat CombineFileInputFormat 是Hadoop提供的抽象类,它在MapReduce读取时合并小文件。合并的文件不会保存到磁盘。相反,该过程读​​取多个文件并将它们即时合并以供单个映射任务使用。


    I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers?

    解决方案

    You can follow one of the three approaches as quoted in this article:

    1. Hadoop Archive File (HAR)
    2. Sequence Files
    3. HBase

    I have found article 1 and article 2, which list multiple solutions (I have removed some non-generic alternatives from these articles):

    1. Change the ingestion process/interval: Change the logic at source level to reduce large number of small files and try to generate small number of big files
    2. Batch file consolidation: When small files are unavoidable, file consolidation is most common solution. With this option you periodically run a simple, consolidating MapReduce job to read all of the small files in a folder and rewrite them into fewer larger files
    3. Sequence files: When there is a requirement to maintain the original filename, a very common approach is to use Sequence files. In this solution, the filename is stored as the key in the sequence file and the file contents are stored as the value
    4. HBase: Instead of writing file to disk,write the file to HBase memory store.
    5. Using a CombineFileInputFormat: The CombineFileInputFormat is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them "on the fly" for consumption by a single map task.

    这篇关于解析数以百万计的小型XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆