PHP本机DOMDocument和简单DOM解析器-是否有大小限制? [英] PHP Native DOMDocument and Simple DOM Parser - is there a size limit?

查看:64
本文介绍了PHP本机DOMDocument和简单DOM解析器-是否有大小限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析HTML文档的内容(由Microsoft Word生产)。遍历DOM以获取我需要的信息/内容,然后将所需的内容输出为CSV。我几乎不做脑外科手术。



现在因为PHP并不是我真正的事情,而且我的时间表很紧,我打算使用 PHP简单HTML DOM解析器来自 http://simplehtmldom.sourceforge.net/



我注意到我的脚本无法正常工作。经过反复试验,我意识到这是由于Word生成的HTML文件的文件大小(它们为3MB,最多有30,000行HTML!)。我认为对 PHP简单HTML DOM解析器和可能的本机PHP DOMDocument API可以解析的文件有文件大小的限制?如果是这样的话,有人知道这个限制是多少吗?我已经搜索40分钟了,但没有成功。



也许我应该只使用Node.js?

解决方案

PHP Native DOMDocument Docs 和它的妹妹 SimpleXMLElement < sup> 文档 没有经过硬编码的大小限制,但受您允许PHP使用的内存限制(请参阅 PHP内存限制 文档 )。



也您不能假定加载100 MB的XML或HTML文件将消耗相同大小的内存。它通常比文件大小少得多的内存(例如,五分之一,十分之一甚至是几分,取决于XML),因此您不能只在这里说X因子,而是如果想要获得精确的度量则需要自己度量信息。)



您在问题中提供的文件大小-3 MB-我会说很小。对于Internet中的HTML文件来说可能不小,但对于基于 libxml 的PHP扩展来说可能并不小。您可以使用 memory_get_usage() 文档



如果您的XML文件非常大-那么通常使用X(HT)ML-例如1.5 千兆字节-使用DOMDocument进行解析将花费大量的准备时间。然后使用 XMLReader Docs 可让您解析文档而不将其完全加载到内存中。但这不是灵丹妙药,因为您仍然有解析时间,但是您可以更好地控制要解析的内容和要跳过的部分,因此您有更大的空间来控制PHP用户区中的优化。






PHP库 PHP简单HTML DOM解析器 文档 也没有施加特定的大小限制。但是,它不是PHP的二进制扩展,而是在PHP用户区。因此,您需要更好地了解该库的确切功能(请参见 simple_html_dom .php 在HEAD版本中)。如果查看代码,您会发现它是一个纯粹用PHP编写的解析器。这是因为它是最初为PHP 4编写的,其中 DOMDocument DOMDocument :: loadHTML 尚不存在。

但是:由于许多年以来不再需要使用该库。许多PHP用户不知道这一点,因此他们使用曾经流行的库来查找过时的代码示例。库 PHP简单HTML DOM解析器甚至在这里有时仍会在Stackoverflow上得到建议。



所以我能给出的最佳建议是:除非您不需要编写与PHP 4兼容的代码,否则根本不要使用该库,也不在乎其限制。而是将您的代码移植到 DOMDocument :: loadHTML() Docs


I need to parse the contents of a HTML document (produced by Microsoft Word). Traversing the DOM to get the information/contents I need then outputting the desired as a CSV. Hardly brain surgery I know.

Now as PHP isn't really my thing and I have a tight schedule I was going to use the PHP Simple HTML DOM Parser from http://simplehtmldom.sourceforge.net/

I noticed that my script isn't working. After trial and error I have realised that this is due to the file size of the HTML files produced by Word (they are 3MB and have as much as 30,000 lines of HTML!). I assume that there is a file size limit to what can be parsed with either the PHP Simple HTML DOM Parser and perhaps the native PHP DOMDocument API? If this is the case does anyone know what this limit is? I've been googling for 40 mins now with no success.

Maybe I should just use Node.js?

解决方案

PHP "Native" DOMDocumentDocs and its little sister SimpleXMLElementDocs do not have a hardencoded size limit, but they are limited by the memory you allow PHP to use (see PHP memory limitDocs).

Also you must not assume that loading a 100 MB XML or HTML file will consume an equal size of memory. It most often is much less memory than the file-size (e.g. a fifth or a tenth or even, depends a bit on the XML so you can not just say factor X here instead you need to metric your own if you want to obtain precise information).

The file-size you give in your question - 3 MB - is rather small I'd say. Maybe not small for a HTML file in the internet but small for the libxml based PHP extensions. You can find out about the memory usage in PHP when loading that file by using memory_get_usage()Docs.

If you have really large XML files - then normally X(HT)ML - let's say 1.5 gigabytes - parsing with DOMDocument will take a lot of lead time. Then using the XMLReaderDocs will allow you to parse the document without loading it into memory (completely). But it is no silver bullet, because you still have the parse-time but you can better control what to parse and which parts to skip so you have more room to control optimizations in PHP userland.


The PHP library PHP Simple HTML DOM ParserDocs does not impose a specific size limit as well. However it's not a binary extension of PHP but in PHP userland. So you need to better understand what exactly that library does (see simple_html_dom.php in HEAD revision). If you review the code you can see it is a parser purely written in PHP. This is because it was original written for PHP 4 where DOMDocument with DOMDocument::loadHTML did not exist yet.

As you can imagine, a PHP extension can manage memory much better than a PHP library written in PHP code. Especially when it comes to tree structures which a HTML Document object model is (this sentence is not true in its own, however developing this memory optimized takes a lot of work and a good design which is not always easy to create nor to maintain).

However: Since many years now it is not necessary to use that library any longer. Many PHP users do not know that and they find outdated code examples using that once popular library. The library PHP Simple HTML DOM Parser even still gets suggested from time to time here on Stackoverflow.

So the best suggestion I can give is: Unless you do not need to write PHP 4 compatible code, do not use that library at all and do not care about its limits. Instead port your code to DOMDocument::loadHTML()Docs.

这篇关于PHP本机DOMDocument和简单DOM解析器-是否有大小限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆