你是如何解析和处理 PHP 中的 HTML/XML 的? [英] How do you parse and process HTML/XML in PHP?

查看:22
本文介绍了你是如何解析和处理 PHP 中的 HTML/XML 的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 PHP 中解析 HTML/XML 以从中提取信息有哪些好的选择?

解决方案

Native XML Extensions

我更喜欢使用其中一种 原生 XML 扩展,因为它们是捆绑的使用 PHP,通常比所有 3rd 方库都快,并为我提供对标记所需的所有控制.

DOM

<块引用>

DOM 扩展允许您使用 PHP 5 通过 DOM API 操作 XML 文档.它是 W3C 文档对象模型核心级别 3 的实现,一个平台和语言中立的接口,允许程序和脚本动态访问和更新文档的内容、结构和样式.

DOM 能够解析和修改现实世界(损坏的)HTML,并且可以执行 XPath 查询.它基于 libxml.

使用 DOM 需要一些时间来提高效率,但 IMO 这段时间非常值得.由于 DOM 是一种与语言无关的接口,您会发现有多种语言的实现,因此如果您需要更改编程语言,那么您很可能已经知道如何使用该语言的 DOM API.

如何使用 DOM 扩展已在 StackOverflow 上广泛介绍,因此,如果您选择使用它,您可以确定您遇到的大部分问题都可以通过搜索/浏览 Stack Overflow 来解决.

一个基本用法示例和一个一般概念概述可在其他答案中找到.

XMLReader

<块引用>

XMLReader 扩展是一个 XML 拉式解析器.阅读器就像一个光标,在文档流上前进并在途中的每个节点处停止.

XMLReader 和 DOM 一样,也是基于 libxml 的.我不知道如何触发 HTML 解析器模块,因此使用 XMLReader 解析损坏的 HTML 的可能性可能不如使用 DOM 强大,您可以明确地告诉它使用 libxml 的 HTML 解析器模块.

另一个答案中提供了一个基本用法示例.

XML 解析器

<块引用>

此扩展可让您创建 XML 解析器,然后为不同的 XML 事件定义处理程序.每个 XML 解析器还有一些您可以调整的参数.

XML Parser 库也基于 libxml,并实现了 SAX 风格的 XML 推送解析器.它可能是比 DOM 或 SimpleXML 更好的内存管理选择,但比 XMLReader 实现的拉式解析器更难使用.

SimpleXml

<块引用>

SimpleXML 扩展提供了一个非常简单且易于使用的工具集,用于将 XML 转换为可以使用普通属性选择器和数组迭代器处理的对象.

当您知道 HTML 是有效的 XHTML 时,SimpleXML 是一个选项.如果您需要解析损坏的 HTML,甚至不要考虑 SimpleXml,因为它会卡住.

基本用法示例,还有PHP 手册中有很多附加示例.


第三方库(基于 libxml)

如果您更喜欢使用第 3 方库,我建议您使用实际使用 DOM/libxml 在下面而不是字符串解析.

FluentDom

<块引用>

FluentDOM 为 PHP 中的 DOMDocument 提供了类似 jQuery 的流畅 XML 接口.选择器是用 XPath 或 CSS 编写的(使用 CSS 到 XPath 转换器).当前版本扩展了 DOM 实现标准接口并添加了来自 DOM Living Standard 的功能.FluentDOM 可以加载 JSON、CSV、JsonML、RabbitFish 等格式.可以通过 Composer 安装.

HtmlPageDom

<块引用>

Wa72HtmlPageDom 是一个用于轻松操作 HTML 的 PHP 库使用 DOM 的文档.它需要 来自 Symfony2 的 DomCrawler用于遍历的组件DOM 树并通过添加操作方法来扩展它HTML 文档的 DOM 树.

phpQuery

<块引用>

phpQuery 是基于 jQuery JavaScript 库的服务器端、可链接、CSS3 选择器驱动的文档对象模型 (DOM) API.该库是用 PHP5 编写的,并提供了额外的命令行界面 (CLI).

这被描述为废弃软件和错误:使用风险自负";但似乎维护得很少.

laminas-dom

<块引用>

LaminasDom 组件(以前称为 Zend_DOM)提供用于处理 DOM 文档和结构的工具.目前,我们提供了 LaminasDomQuery,它提供了一个统一的界面,用于使用 XPath 和 CSS 选择器查询 DOM 文档.

这个包被认为是功能完整的,现在处于纯安全维护模式.

fDOMDocument

<块引用>

fDOMDocument 扩展了标准 DOM 以在所有错误情况下使用异常而不是 PHP 警告或通知.为了方便和简化 DOM 的使用,它们还添加了各种自定义方法和快捷方式.

sabre/xml

<块引用>

sabre/xml 是一个库,它包装并扩展了 XMLReader 和 XMLWriter 类以创建一个简单的xml 到对象/数组".映射系统和设计模式.写入和读取 XML 是单程的,因此速度很快,并且在大型 xml 文件上需要的内存较少.

FluidXML

<块引用>

FluidXML 是一个 PHP 库,用于使用简洁流畅的 API 操作 XML.它利用 XPath 和 fluent 编程模式变得有趣和有效.


第三方(不是基于 libxml 的)

基于 DOM/libxml 构建的好处是,您可以获得良好的开箱即用性能,因为您基于本机扩展.然而,并不是所有的 3rd-party libs 都走这条路.下面列出了其中一些

PHP 简单 HTML DOM 解析器

<块引用>
  • 用 PHP5+ 编写的 HTML DOM 解析器可让您以非常简单的方式操作 HTML!
  • 需要 PHP 5+.
  • 支持无效的 HTML.
  • 使用选择器在 HTML 页面上查找标签,就像 jQuery 一样.
  • 在一行中从 HTML 中提取内容.

我一般不推荐这个解析器.代码库很糟糕,解析器本身很慢,而且很耗内存.并非所有 jQuery 选择器(例如 子选择器)都是可能的.任何基于 libxml 的库都应该轻松胜过这一点.

PHP Html 解析器

<块引用>

PHPHtmlParser 是一个简单、灵活的 html 解析器,它允许您使用任何 css 选择器(如 jQuery)来选择标签.目标是帮助开发需要快速、简单的方法来抓取 html 的工具,无论它是否有效!这个项目最初是由sunra/php-simple-html-dom-parser支持的,但是好像已经停止支持了,所以这个项目是我对他之前作品的改编.

同样,我不推荐这个解析器.CPU 使用率高时速度相当慢.也没有清除创建的 DOM 对象内存的功能.这些问题尤其适用于嵌套循环.文档本身不准确且拼写错误,自 16 年 4 月 14 日以来没有对修复做出回应.


HTML 5

您可以使用上面的代码来解析 HTML5,但是由于 HTML5 允许的标记,可能会有一些怪癖.因此,对于 HTML5,您可能需要考虑使用专用解析器.请注意,这些是用 PHP 编写的,因此与使用较低级别语言的编译扩展相比,性能会降低并且内存使用量会增加.

HTML5DomDocument

<块引用>

HTML5DOMDocument 扩展了原生 DOMDocument 库.它修复了一些错误并添加了一些新功能.

  • 保留 html 实体(DOMDocument 不保留)
  • 保留无效标签(DOMDocument 不保留)
  • 允许插入 HTML 代码,将正确的部分移动到适当的位置(头部元素插入头部,正文元素插入正文)
  • 允许使用 CSS 选择器查询 DOM(当前可用:*tagnametagname#id#idtagname.classname.classnametagname.classname.classname2.classname.classname2tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
  • 添加对元素->classList 的支持.
  • 添加对元素->innerHTML 的支持.
  • 添加了对元素->outerHTML 的支持.

HTML5

<块引用>

HTML5 是完全用 PHP 编写的符合标准的 HTML5 解析器和编写器.它稳定并在许多生产网站中使用,下载量超过 500 万次.

HTML5 提供以下功能.

<块引用>

  • 一个 HTML5 序列化器
  • 支持 PHP 命名空间
  • 作曲家支持
  • 基于事件(类 SAX)的解析器
  • DOM 树构建器
  • 与 QueryPath 的互操作性
  • 在 PHP 5.3.0 或更高版本上运行


正则表达式

最后和最不推荐,您可以使用正则表达式从 HTML 中提取数据一>.通常不鼓励在 HTML 上使用正则表达式.

您会在网络上找到的大多数与标记匹配的片段都很脆弱.在大多数情况下,它们仅适用于非常特殊的 HTML 片段.微小的标记更改,例如在某处添加空格,或者添加或更改标签中的属性,都可能导致 RegEx 在编写不正确时失败.在 HTML 上使用 RegEx 之前,您应该知道自己在做什么.

HTML 解析器已经知道 HTML 的语法规则.必须为您编写的每个新 RegEx 教授正则表达式.在某些情况下,正则表达式很好,但这实际上取决于您的用例.

可以编写更可靠的解析器,但使用正则表达式编写完整可靠的自定义解析器当上述库已经存在并且在这方面做得更好时,这是浪费时间.

另见解析Html克苏鲁方式


书籍

如果你想花一些钱,看看

我不隶属于 PHP 架构师或作者.

What are good options for parsing HTML/XML in PHP in order to extract information from it?

Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

A basic usage example and a general conceptual overview are available in other answers.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example is available in another answer.

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example is available, and there are lots of additional examples in the PHP Manual.


3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72HtmlPageDom is a PHP library for easy manipulation of HTML documents using DOM. It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

phpQuery

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library. The library is written in PHP5 and provides additional Command Line Interface (CLI).

This is described as "abandonware and buggy: use at your own risk" but does appear to be minimally maintained.

laminas-dom

The LaminasDom component (formerly Zend_DOM) provides tools for working with DOM documents and structures. Currently, we offer LaminasDomQuery, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

This package is considered feature-complete, and is now in security-only maintenance mode.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.


3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.


HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so suffer from slower performance and increased memory usage compared to a compiled extension in a lower-level language.

HTML5DomDocument

HTML5DOMDocument extends the native DOMDocument library. It fixes some bugs and adds some new functionality.

  • Preserves html entities (DOMDocument does not)
  • Preserves void tags (DOMDocument does not)
  • Allows inserting HTML code that moves the correct parts to their proper places (head elements are inserted in the head, body elements in the body)
  • Allows querying the DOM with CSS selectors (currently available: *, tagname, tagname#id, #id, tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
  • Adds support for element->classList.
  • Adds support for element->innerHTML.
  • Adds support for element->outerHTML.

HTML5

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features.

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer


Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way


Books

If you want to spend some money, have a look at

I am not affiliated with PHP Architect or the authors.

这篇关于你是如何解析和处理 PHP 中的 HTML/XML 的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆