文本::平衡和多行 xml [英] Text::Balanced and multiline xml

查看:48
本文介绍了文本::平衡和多行 xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我好像有点失落了.

我需要解析一个大的(大约 100 mb)并且非常难看的 xml 文件.如果我使用 parsefile,它会返回错误(文档元素之后的垃圾),但它会很高兴地解析文件的较小元素.

I need to parse a large (about 100 mb) and quite ugly xml file. If I use parsefile, it returns error (junk after document element), but it would happily parse smaller elements of the file.

所以我决定将文件分解成元素并解析它们.由于不鼓励使用正则表达式解析 XML(无论如何我都尝试过,但我得到了重复的结果),我尝试了 Text::Balanced.

So I decided to break the file into elements and parse them. Since parsing XML with regular expressions is discouraged (well I tried it anyway, but I get duplicating results), I tried Text::Balanced.

类似的东西

use Text::Balanced qw/extract_tagged/;

while (<FILE>) {
     my $result = extract_tagged($_, "<tag>");
     print $result if defined $result;
}

工作得很好,所以我可以提取适合一行的标记条目.然而,有了更大的东西

works just fine, so I can extract tagged entries which fit into one line. With something bigger, however

use Text::Balanced qw/extract_tagged/;
use File::Slurp;

my $test = read_file("file");
my $result = extract_tagged($text, "<tag>");
print $result;

不起作用.它读取文件,但在那里找不到标记的项目.

does not work. It reads the file but it can not find a tagged item there.

所以问题是如何在没有 XML::Parser 的情况下在给定标签之间提取任何内容?如果可能的话,我真的真的需要避免咀嚼它.

So the question is how do I extract anything between given tags without XML::Parser? And I really really need to avoid chomping it if possible.

附言搜索将返回正则表达式指南、heredoc howtos 以及除我要查找的内容之外的任何内容

P.S. search would return regex guides, heredoc howtos and anything but what I look for

P.P.S.我是个白痴,一直试图解析一个无效的文件.仍然很好奇如果解析器失败如何截断文件.

P.P.S. I'm a moron, been trying to parse an invalid file. Still curious how to chop a file if the parser fails though.

bvr 的回答很接近,它确实会检索一些数据,但如果缺少顶级标签则不会.

bvr's answer was close, it really would retrieve some data, but not if the top level tag is missing.

推荐答案

对于损坏的 XML,我会尝试设置 recover 选项到 XML::LibXML.它使它忽略解析错误并继续.

For broken XML, I would try setting recover option to XML::LibXML. It makes it ignore parsing errors and continue.

这篇关于文本::平衡和多行 xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆