无法从网页获取某些标题 [英] Trouble fetching some title from a webpage

查看：169 发布时间：2020/10/13 3:20:28 php curl web-scraping domdocument

本文介绍了无法从网页获取某些标题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经用php编写了一个脚本，以从 hair fall shamboo 中刮出 title 网页。当我执行下面的脚本时，出现以下错误：

I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:

注意：试图获取C中非对象的属性 nodeValue ：第16行的xampp\htdocs\runcode\testfile.php。

Notice: Trying to get property 'nodeValue' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.

链接到该网站

我编写的脚本尝试使用：

Script I've tried with:

<?php function get_content($url){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_exec($ch); $htmlContent = curl_exec($ch); curl_close($ch); return $htmlContent; } $link = "https://www.purplle.com/search?q=hair%20fall%20shamboo"; $xml = get_content($link); $dom = @DOMDocument::loadHTML($xml); $xpath = new DOMXPath($dom); $title = $xpath->query('//h1[@class="br-hdng"]/span')->item(0)->nodeValue; echo "{$title}"; ?>

我的预期输出是：

hair fall shamboo

尽管 xpath 我在上述脚本中使用的似乎是正确的，我在此处粘贴了html元素的相关部分，可以在其中找到 title >

Although the xpath I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title can be found:

<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo</span></h1>

PostScript： title 我想解析的是动态加载的。由于我是php新手，所以我不了解我尝试的方法是否正确。如果不是，那我该怎么办？

PostScript: The title I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?

以下是我使用两种不同语言创建的脚本，发现它们像魔术一样工作。

The following are the scripts I've created using two different languages and found them working like magic.

我使用 javascript取得了成功：

const puppeteer = require('puppeteer'); function run () { return new Promise(async (resolve, reject) => { try { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo"); let urls = await page.evaluate(() => { let items = document.querySelector('h1.br-hdng span'); return items.innerText;; }) browser.close(); return resolve(urls); } catch (e) { return reject(e); } }) } run().then(console.log).catch(console.error);

再次，我使用 python 获得成功：

Again, I got success using python:

import requests_html with requests_html.HTMLSession() as session: r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo') r.html.render() item = r.html.find("h1.br-hdng span",first=True).text print(item)

然后 php 怎么了？

推荐答案

很可能是您的代码中的问题比我在此答案中讨论的要多，但是我看到的最突出的问题是：

It could very well be that there are more issues with your code than I have covered in this answer, but the most prominent issue that I see is the following:

DOMDocument :: loadHTML（） 不是静态方法，而是实例方法（返回布尔值）。您应该首先创建 DOMDocument 的实例，然后在该实例上调用 loadHTML（）：

DOMDocument::loadHTML() is not a static method, but an instance method (which returns a boolean). You should first create an instance of DOMDocument and then call loadHTML() on that instance:

$dom = new DOMDocument; $dom->loadHTML($xml);

但是，由于您通过 @ 运算符，您没有收到关于此的警告。并且虽然很常见的是使用错误抑制操作符 @ 来抑制HTML验证错误，但您应该考虑使用 libxml_use_internal_errors（） ¹，因为这不会抑制一般的PHP错误。
However, since you have suppressed errors with the @ operator on that particular line, you are not receiving a warning about this. And although it's very commonly seen that the error suppressor operator @ is used to suppress HTML validation errors, like this, you should look into using libxml_use_internal_errors()¹ instead, as this does not suppress general PHP errors. $dom = new DOMDocument; $oldSetting = libxml_use_internal_errors(true); $dom->loadHTML($xml); libxml_use_internal_errors($oldSetting);
最后一点：可以从带有 cURL） noreferrer> DOMDocument :: loadHTMLFile（） ，如果您的PHP安装被配置为允许通过配置设置 allow_url_fopen 。请注意，尽管出于安全原因通常会禁用此设置，但如果打算使用它，请谨慎使用。

As a final note:
It's possible to load a DOM document from a URL directly (without the need for cURL) with DOMDocument::loadHTMLFile(), if your PHP installation is configured to allow loading of URLs via the configuration setting allow_url_fopen. Be aware though that this setting is often disabled for security reasons, so use it with care, if you plan on using it.

这是一个简单的测试用例，应能按预期工作：

Here's a simple test-case which should work as expected:

<?php $html = ' <html> <head> <title>DOMDocument test-case</title> </head> <body> <div class="dummy-container"> <h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo</span></h1> </div> </body>'; $dom = new DOMDocument; $oldSetting = libxml_use_internal_errors(true); $dom->loadHTML( $html ); libxml_use_internal_errors($oldSetting); $xpath = new DOMXPath( $dom ); $title = $xpath->query( '//h1[@class="br-hdng"]/span' )->item( 0 )->nodeValue; echo $title;

^{请参见在3v4l.org上在线解释的示例}

您应替换 $ html 和您的 get_content（）调用的输出。如果它不起作用，则可以执行以下操作：

You should replace the contents of $html with the output of your get_content() call. If it doesn't work, then either:

使用 cURL获取HTML时出现问题（执行 var_dump（$ html）; ，然后再加载到 DOMDocument 中查看您检索到的内容），或者...

there's something wrong with fetching the HTML with cURL (do var_dump( $html ); before loading into DOMDocument, for instance, to see the contents you retrieved), or...

也许您正在命名空间中工作，在这种情况下，应在<$ c $之前加一个反斜杠。 c> DOMDocument 和 DOMXPath ，即： new \DOMDocument; 和新的\DOMXPath（$ dom）; 。

perhaps you are working inside a namespace, in which case you should prepend a backslash before DOMDocument and DOMXPath, i.e.: new \DOMDocument; and new \DOMXPath( $dom );.

^{1。 LibXML是DOMDocument用来解析XML / HTML文档的XML库。}

这篇关于无法从网页获取某些标题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法从网页获取某些标题 [英] Trouble fetching some title from a webpage

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

无法从网页获取某些标题 [英] Trouble fetching some title from a webpage

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭