“内存不足";使用perl解析大型(100 Mb)XML文件时 [英] "Out of memory" while parsing large (100 Mb) XML file using perl

查看:144
本文介绍了“内存不足";使用perl解析大型(100 Mb)XML文件时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

解析大型(100 Mb)XML文件时出现错误"内存不足"

I have an error "Out of memory" while parsing large (100 Mb) XML file

use strict;
use warnings;
use XML::Twig;

my $twig=XML::Twig->new();
my $data = XML::Twig->new
             ->parsefile("divisionhouserooms-v3.xml")
               ->simplify( keyattr => []);

my @good_division_numbers = qw( 30 31 32 35 38 );

foreach my $property ( @{ $data->{DivisionHouseRoom}}) {

    my $house_code = $property->{HouseCode};
    print $house_code, "\n";

    my $amount_of_bedrooms = 0;

    foreach my $division ( @{ $property->{Divisions}->{Division} } ) {

        next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
        $amount_of_bedrooms += $division->{DivisionQuantity};
    }

    open my $fh, ">>", "Result.csv" or die $!;
    print $fh join("\t", $house_code, $amount_of_bedrooms), "\n";
    close $fh;
}

该如何解决该错误问题?

What i can do to fix this error issue?

推荐答案

做广告:

XML::Twig的优点之一是它使您可以处理文件 不能容纳在内存中的文件(顺便说一句,XML文档以 树是相当昂贵的内存,扩展因子通常是 大约10).

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

为此,您可以定义处理程序,该处理程序将在 特定元素已被完全解析.在这些处理程序中,您可以 访问元素并按您认为合适的方式对其进行处理(...)

To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit (...)


问题中发布的代码根本没有利用XML::Twig的强度(使用simplify方法并不比


The code posted in the question isn't making use of the strength of XML::Twig at all (using the simplify method doesn't make it much better than XML::Simple).

代码中缺少的是'twig_handlers'或'twig_roots',它们实际上使解析器有效地将注意力集中在XML文档的相关部分上.

What's missing from the code are the 'twig_handlers' or 'twig_roots', which essentially cause the parser to focus on relevant portions of the XML document memory-efficiently.

很难说不看XML是否逐个处理文档-块只是选定的部分是可行的方法,但是任何一个都应该解决这个问题.

It's difficult to say without seeing the XML whether processing the document chunk-by-chunk or just selected parts is the way to go, but either one should solve this issue.

因此,代码应类似于以下内容(逐块演示):

So the code should look something like the following (chunk-by-chunk demo):

use strict;
use warnings;
use XML::Twig;
use List::Util 'sum';   # To make life easier
use Data::Dump 'dump';  # To see what's going on

my %bedrooms;           # Data structure to store the wanted info

my $xml = XML::Twig->new (
                          twig_roots => {
                                          DivisionHouseRoom => \&count_bedrooms,
                                        }
                         );

$xml->parsefile( 'divisionhouserooms-v3.xml');

sub count_bedrooms {

    my ( $twig, $element ) = @_;

    my @divParents = $element->children( 'Divisions' );
    my $id = $element->first_child_text( 'HouseCode' );

    for my $divParent ( @divParents ) {
        my @divisions = $divParent->children( 'Division' );
        my $total = sum map { $_->text } @divisions;
        $bedrooms{$id} = $total;
    }

    $element->purge;   # Free up memory
}

dump \%bedrooms;

这篇关于“内存不足";使用perl解析大型(100 Mb)XML文件时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆