在PHP中容错HTML / XML / SGML解析 [英] Error Tolerant HTML/XML/SGML parsing in PHP

查看:98
本文介绍了在PHP中容错HTML / XML / SGML解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆类似HTML的遗留文档。如同它们一样,它们看起来像HTML,但是有额外的标记,它们不是HTML的一部分。

 < strong>这是<伪模板>伪标签< /伪模板>< / strong>< 

我需要解析这些文件。 PHP是唯一可用的工具。这些文档并不接近于格式良好的XML。

我最初的想法是在PHP DOMDocument上使用loadHTML方法。但是,这些方法会扼制HTML标记,并拒绝解析字符串/文件。

  $ oDom = new的DomDocument(); 
$ oDom-> loadHTML(< strong>这是<伪模板>假标签< /伪模板>< / strong>的示例。
//给我们
DOMDocument :: loadHTML()[function.loadHTML]:在实体中标记伪模板无效,行:1发生在....

我唯一能够想到的解决方案是使用字符串替换函数预处理文件,这些函数将删除无效标签并将它们替换为有效的HTML标记(也许是标记名称为id的跨度)。



是否有更优雅的解决方案?一种让DOMDocument知道其他标签被视为有效的方法?是否有一个不同的,健壮的HTML解析类/对象用于PHP?



(如果不明显,我不认为正则表达式是一个有效的解决方案) / p>

更新:假标签中的信息是此处目标的一部分,因此Tidy不是一种选择。另外,我追求的是某些层次(如果不是全部)的格式清理对我来说,这就是为什么我首先查找DomDocument的loadHTML方法。

解决方案

你可以用 libxml_use_internal_errors ,同时加载文档。例如:

  libxml_use_internal_errors(true); 
$ doc = new DomDocument();
$ doc-> loadHTML(< strong>这是<伪模板>假标签< /伪模板>< / strong>的示例。
libxml_use_internal_errors(false);

如果由于某种原因需要访问警告,请使用 libxml_get_errors


I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML

<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>

I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.

My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.

$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....

The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).

Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?

(if it's not obvious, I don't consider regular expressions a valid solution here)

Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.

解决方案

You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:

libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);

If, for some reason, you need access to the warnings, use libxml_get_errors

这篇关于在PHP中容错HTML / XML / SGML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆