在使用DOM解析HTML时保持文件偏移量? [英] Keeping file offsets while parsing HTML with the DOM?

查看:121
本文介绍了在使用DOM解析HTML时保持文件偏移量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在不太格式错误的HTML(WordPress文章)中修改< img src => 属性。我知道我可以采取简单的方式并使用正则表达式,但我害怕蓝色毛茸茸的西装人会在我的睡眠中困扰我

I want to modify <img src=""> attributes in not-too-malformed HTML (WordPress posts). I know I can take the simple way and use regexes, but I'm afraid people in blue furry suits will come haunt me in my sleep.

如果我使用DOM解析器读取HTML并修改< img> 标签,恐怕我无法完全按照原样重建帖子(只有我的修改),因为DOM解析器可能会做太多的清理,并可能删除基本数据。一个SAX解析器可能无法处理无效的XML,所以这也不起作用。

If I use the DOM parser to read the HTML and modify the <img> tags, I'm afraid I can't reconstruct the post exactly as it was (with only my modification), because the DOM parser will probably do too much cleanup and maybe remove essential data. A SAX parser can probably not handle invalid XML, so this will also not work.

那么,有没有中间的方式,我可以使用一个DOM解析器,但是一个知道每个元素在哪里开始,所以我可以做字符串替换或类似的东西?我知道DOM树中的一些节点不会存在于源文档中(< b> Some< i> bizarre< / b>格式化< / i> 触发这个),但这是否意味着永远是不可能的?我看到有一个 DOMNode :: getLineNo() function 添加在PHP 5.3中,但我使用的是5.2.x。

So, is there a middle way, where I can use a DOM parser, but one that knows where each element started, so I can do string replacements or something similar from there? I know some nodes in the DOM tree will not exist in the source document (<b>Some <i>bizarre</b> formatting</i> will probably trigger this), but does this mean it is always impossible? I see there is a DOMNode::getLineNo() function added in PHP 5.3, but I'm using 5.2.x.

推荐答案

如果PHP的DOM会写出太干净的结果,那么您可以尝试基于字符串的 SimpleHTMLDOM 是否更宽松。

If PHP's DOM will write "too clean" results, you could try string-based SimpleHTMLDOM whether it's more lenient.

然而,由于您显示的格式化奇怪,所以我绝对不会完全相信解析器做到这一点。但是尝试一下,也许它只是跳过这样的东西。

However, with formatting as bizarre as you show, I would never entirely trust the parser to do it "right". But try it out, maybe it just skips such stuff.

DOM库的 DOMNode 类有一个 getLineNo() 方法。我不完全看到这是如何工作的,因为它没有提供抵消。不确定是否有助于您的用例。

The DOM library's DOMNode class has a getLineNo() method. I don't entirely see how this works though, seeing as it doesn't provide an offset to go with it. Not sure whether that'll help your use case.

这篇关于在使用DOM解析HTML时保持文件偏移量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆