您如何在PHP中解析和处理HTML/XML? [英] How do you parse and process HTML/XML in PHP?

查看:70
本文介绍了您如何在PHP中解析和处理HTML/XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何解析HTML/XML并从中提取信息?

How can one parse HTML/XML and extract information from it?

推荐答案

本地XML扩展

我更喜欢使用原生XML扩展之一,因为它们是捆绑在一起的PHP,通常比所有第3方库都快,并为我提供了所需的标记控制权.

Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM扩展使您可以通过PHP 5通过DOM API对XML文档进行操作.它是W3C的Document Object Model Core Level 3的实现,它是一种平台和语言无关的界面,允许程序和脚本动态地进行操作.访问和更新文档的内容,结构和样式.

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM能够解析和修改现实世界(损坏的)HTML,并且可以执行 XPath查询.它基于 libxml .

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

使用DOM需要花一些时间,但是IMO值得花时间.由于DOM是与语言无关的接口,因此您会发现许多语言的实现,因此,如果您需要更改编程语言,那么您很可能已经知道如何使用该语言的DOM API.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

一个基本用法示例可以在 php中的DOMDocument

A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php

StackOverflow上已广泛介绍了如何使用DOM扩展,因此,如果您选择使用它,则可以确保通过搜索/浏览Stack Overflow可以解决所遇到的大多数问题.

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

XMLReader扩展是XML提取解析器.阅读器充当光标,在文档流上前进,并在途中的每个节点处停止.

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader与DOM一样,也是基于libxml的.我不知道如何触发HTML解析器模块,因此使用XMLReader解析损坏的HTML的机会可能不如使用DOM健壮,因为您可以明确地告诉它使用libxml的HTML解析器模块.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

可以在 getting中找到基本用法示例使用PHP的h1标记中的所有值

此扩展允许您创建XML解析器,然后为不同的XML事件定义处理程序.每个XML解析器还具有一些可以调整的参数.

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

XML解析器库也基于libxml,并实现了 SAX 样式的XML推送解析器.与DOM或SimpleXML相比,它可能是内存管理的更好选择,但与XMLReader实现的请求解析器相比,使用起来更加困难.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXML扩展提供了一个非常简单易用的工具集,可以将XML转换为可以使用常规属性选择器和数组迭代器进行处理的对象.

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

当您知道HTML是有效的XHTML时,可以选择SimpleXML.如果您需要解析损坏的HTML,甚至不用考虑SimpleXml,因为它会阻塞.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

可以在如果您喜欢使用第三方库,建议您使用实际上使用 DOM / libxml 放在下面,而不是字符串解析.

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDOM为PHP中的DOMDocument提供了类似jQuery的流畅XML接口.选择器以XPath或CSS编写(使用CSS到XPath转换器).当前版本扩展了DOM的实现标准接口,并增加了DOM Living Standard的功能. FluentDOM可以加载JSON,CSV,JsonML,RabbitFish等格式.可以通过Composer安装.

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72 \ HtmlPageDom`是一个PHP库,可轻松操作HTML 使用It的文档需要来自Symfony2的 DomCrawler 组件 DOM树,并通过添加操作DOM的方法对其进行扩展 HTML文档树.

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

phpQuery (多年未更新)

phpQuery (not updated for years)

phpQuery是服务器端可链接的CSS3选择器驱动的文档对象模型(DOM)API,它基于用PHP5编写的jQuery JavaScript库,并提供了附加的命令行界面(CLI).

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

另请参阅: https://github.com/electrolinux/phpquery

Zend_Dom提供了用于处理DOM文档和结构的工具.当前,我们提供Zend_Dom_Query,它提供了一个统一的接口,用于同时使用XPath和CSS选择器查询DOM文档.

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath是一个用于处理XML和HTML的PHP​​库.它不仅可以与本地文件一起使用,还可以与Web服务和数据库资源一起使用.它实现了许多jQuery接口(包括CSS样式的选择器),但为服务器端使用做了很大的调整.可以通过Composer安装.

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

fDOMDocument

fDOMDocument扩展了标准DOM以在所有错误情况下都使用异常,而不是PHP警告或通知.他们还添加了各种自定义方法和快捷方式,以方便使用并简化DOM的使用.

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml是一个包装和扩展XMLReader和XMLWriter类的库,以创建简单的从XML到对象/数组"的映射系统和设计模式.读写XML是单次通过,因此速度很快,并且在大型xml文件上所需的内存较少.

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML是一个PHP库,用于使用简洁流畅的API来处理XML. 它利用XPath和流畅的编程模式来使游戏变得有趣而有效.

FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.


第三方(不是基于libxml)

基于DOM/libxml进行构建的好处是,由于您基于本机扩展,因此可以立即获得良好的性能.但是,并非所有第3方库都遵循这条路线.下面列出了其中的一些


3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

  • 用PHP5 +编写的HTML DOM解析器使您能够以非常简单的方式操作HTML!
  • 需要PHP 5 +.
  • 支持无效的HTML.
  • 使用jQuery之类的选择器在HTML页面上查找标签.
  • 单行从HTML提取内容.
  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

我通常不建议使用此解析器.代码库太可怕了,解析器本身也很慢并且占用大量内存.并非所有的jQuery选择器(例如子选择器)都是可行的.任何基于libxml的库都应该容易地胜过它.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHPHtmlParser是一个简单,灵活的html解析器,它使您可以使用任何CSS选择器(如jQuery)来选择标签.目标是协助开发工具,这些工具需要快速,简便的方式来废弃html,无论它是否有效!该项目最初由sunra/php-simple-html-dom-parser支持,但该支持似乎已停止,因此该项目是我对他先前工作的改编.

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

同样,我不建议使用此解析器. CPU使用率很高时,速度相当慢.也没有清除创建的DOM对象的内存的功能.这些问题在嵌套循环中尤为严重.该文档本身不准确且拼写错误,自16年4月14日以来未对修复程序做出任何反应.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.

  • 通用令牌生成器和HTML/XML/RSS DOM解析器
    • 能够操纵元素及其属性
    • 支持无效的HTML和UTF8
  • A universal tokenizer and HTML/XML/RSS DOM Parser
    • Ability to manipulate elements and their attributes
    • Supports invalid HTML and UTF8
  • 缩小CSS和Javascript
  • 排序属性,更改字符大小写,正确的缩进等
  • 使用基于当前字符/令牌的回调解析文档
  • 将操作分成较小的功能,以便于覆盖

从未使用过它.无法判断是否有好处.

Never used it. Can't tell if it's any good.

您可以使用以上内容来解析HTML5,但可能会有怪癖.因此,对于HTML5,您想考虑使用专用的解析器,例如

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

html5lib

基于WHATWG HTML5规范的HTML解析器的Python和PHP实现,以实现与主要桌面Web浏览器的最大兼容性.

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

HTML5完成后,我们可能会看到更多专用的解析器. W3也有一个博客帖子,标题为 HTML 5的操作方法解析,值得一试.

We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.

如果您不想编写PHP,也可以使用Web服务.通常,我发现这些工具的实用性很小,但这只是我和我的用例.

If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.

ScraperWiki的外部界面使您可以提取想要在Web上或在自己的应用程序中使用的形式的数据.您还可以提取有关任何刮板状态的信息.

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.


正则表达式

最后(最不推荐),您可以使用正则表达式从HTML提取数据.通常不建议在HTML上使用正则表达式.


Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

您在网络上可以找到与标记匹配的大多数代码片段都很脆弱.在大多数情况下,它们仅适用于非常特殊的HTML.微小的标记更改(例如在某处添加空格,在标签中添加或更改标签中的属性)会使RegEx在书写不正确时失败.在HTML上使用RegEx之前,您应该知道自己在做什么.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML解析器已经知道HTML的语法规则.必须为您编写的每个新RegEx教授正则表达式. RegEx在某些情况下还可以,但实际上取决于您的用例.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

可以编写更可靠的解析器,但是使用上述正则表达式编写完整且可靠的自定义解析器会浪费时间,因为上述库已经存在并且在此方面做得更好.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

另请参见解析HTML的Cthulhu方式

如果您想花一些钱,请看

If you want to spend some money, have a look at

我与PHP Architect或作者无关.

I am not affiliated with PHP Architect or the authors.

这篇关于您如何在PHP中解析和处理HTML/XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆