HTML :: PullParser随机分割文本元素 [英] HTML::PullParser splits up text element randomly

查看:99
本文介绍了HTML :: PullParser随机分割文本元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Perl模块HTML::PullParser.我注意到有时它会随机拆分一个文本元素(据我所知).

I'm using Perl module HTML::PullParser. I noticed that it sometimes splits up a text element (as far as I can tell) randomly.

例如,如果我有一个HTML文件test.html,其内容为

For example, if I have a html file test.html with the content of

<html>
...
<FONT STYLE="font-family:Times New Roman" SIZE="2">THE QUICK BROWN FOX</FONT>
...
</html>

我的perl代码看起来像

And my perl code looks something like

my $html = HTML::PullParser->new(file => 'test.html', text => '"T", text');
while (my $token = $html->get_token) {
    print "$$token[1]\n";
}

然后有时候我回来

THE QUICK BROWN FOX    # correctly parsed

但是其他时候我得到

THE QUICK
 BROWN FOX

,其中text元素被解析为两个单独的标记.但是在其他时候,根据html文件的其他内容,我得到了

where the text element is parsed into two separate tokens. Yet at other times, depending on the other content of the html file, I get

THE QUICK BROWN
 FOX

断裂点不同.这种行为非常烦人.我尽力找出问题所在.看起来它取决于文件的整体(即,如果我删除文件的其余部分以仅保留该元素,那很好).但是,我无法确定文件其余部分的哪一部分导致了此问题.想知道是否有人有类似的经历并且知道如何解决该问题?谢谢!

where the breaking point is different. This behavior is extremely annoying. And I tried my best to isolate the problem. Looks like it is dependent on the entirety of the file (i.e. if I delete the rest of the file to have only that element left, then it is fine). However, I'm not able to identify what part of the rest of the file caused this. Wondering if anyone had similar experience and know how to get around the issue? Thx!!

更新:此错误行为的发生也不依赖于文件中其他位置的html代码的单个部分.我能够在该文本元素之前隔离出html代码的两部分-当同时存在这两个部分时,就会发生此错误.但是,当一个人出现而没有另一个人时,这个问题就消失了……我绝对感到困惑和烦恼.

UPDATE: the occurrence of this errant behavior is also NOT dependent on a single section of html code elsewhere in the file. I was able to isolate two sections of html codes prior to that text element - when both of them are present, this error occurs. But when either one is present without the other, this problem goes away... I'm absolutely confused and annoyed.

推荐答案

HTML :: PullParser是HTML :: Parser的子类. HTML :: Parser具有 unbroken_text 属性,用于控制是否吐出文本事件,或者是否将文本缓冲,直到解析器知道不再有文本为止.默认设置是尽快生成文本节点. $p->unbroken_text(1)调用应使其具有缓冲:)

HTML::PullParser is a subclass of HTML::Parser. HTML::Parser has an unbroken_text attribute that controls whether it spits out text events as soon as possible, or whether it buffers text up until the parser knows that no more text is coming. The default is to generate text nodes as soon as possible. a $p->unbroken_text(1) call should make it buffer :)

这篇关于HTML :: PullParser随机分割文本元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆