使用PHP的DOMDocument :: preserveWhiteSpace = false并仍获取空白 [英] Using PHP's DOMDocument::preserveWhiteSpace = false and still getting whitespace

查看:262
本文介绍了使用PHP的DOMDocument :: preserveWhiteSpace = false并仍获取空白的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取此页面:

http ://kat.ph/search/example/?field = seeders& sorder = desc

通过这种方式:

...
curl_setopt( $curl, CURLOPT_URL, $url );
$header = array (
    'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding:gzip,deflate,sdch',
    'Accept-Language:en-US,en;q=0.8',
    'Cache-Control:max-age=0',
    'Connection:keep-alive',
    'Host:kat.ph',
    'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
);
curl_setopt( $curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19'); 
curl_setopt( $curl, CURLOPT_HTTPHEADER, $header ); 
curl_setopt( $curl, CURLOPT_REFERER, 'http://kat.ph' ); 
curl_setopt( $curl, CURLOPT_ENCODING, 'gzip,deflate,sdch' ); 
curl_setopt( $curl, CURLOPT_AUTOREFERER, true ); 
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1 ); 
curl_setopt( $curl, CURLOPT_TIMEOUT, 10 );

$html = curl_exec( $curl );
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
@$dom->loadHTML( $html );

必须模仿浏览器才能正常工作,因此可以使用CURL

但我仍然得到 #text 类型的 DOMNodes

关于为什么发生这种情况以及如何避免这种情况的任何想法?

Any ideas of why is this happening and how to avoid it?

推荐答案

看起来 preserveWhiteSpace 属性只是 设置 libxml2 XML_PARSE_NOBLANKS 标志,该标志并非总是可靠,因为 此线程 建议。具体来说,在这种情况下,在没有DTD的情况下进行解析时,解析器在某些情况下(主要是它们是其他非文本元素的同级元素)保留空文本元素。

It looks like that the preserveWhiteSpace property simply sets the libxml2 XML_PARSE_NOBLANKS flag, which is not always reliable as this thread suggests. Specifically, when parsing without a DTD as in this case the parser keeps empty text elements under some circumstances (mainly if they are siblings of other non-text elements).

线程可能有点过时,但行为 仍然存在,如所述

The thread may be a bit dated, but the behavior still exists as described.

这篇关于使用PHP的DOMDocument :: preserveWhiteSpace = false并仍获取空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆