无法从网页获取某些标题 [英] Trouble fetching some title from a webpage
问题描述
我已经用php编写了一个脚本,以从 hair fall shamboo 中刮出 title 网页。当我执行下面的脚本时,出现以下错误:
I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:
注意:试图获取C中非对象的属性 nodeValue :第16行的xampp\htdocs\runcode\testfile.php。
Notice: Trying to get property 'nodeValue' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.
我编写的脚本尝试使用:
Script I've tried with:
<?php
function get_content($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$htmlContent = curl_exec($ch);
curl_close($ch);
return $htmlContent;
}
$link = "https://www.purplle.com/search?q=hair%20fall%20shamboo";
$xml = get_content($link);
$dom = @DOMDocument::loadHTML($xml);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//h1[@class="br-hdng"]/span')->item(0)->nodeValue;
echo "{$title}";
?>
我的预期输出是:
hair fall shamboo
尽管 xpath
我在上述脚本中使用的似乎是正确的,我在此处粘贴了html元素的相关部分,可以在其中找到 title
>
Although the xpath
I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title
can be found:
<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>
PostScript: title
我想解析的是动态加载的。由于我是php新手,所以我不了解我尝试的方法是否正确。如果不是,那我该怎么办?
PostScript: The title
I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?
以下是我使用两种不同语言创建的脚本,发现它们像魔术一样工作。
The following are the scripts I've created using two different languages and found them working like magic.
我使用 javascript取得了成功
:
const puppeteer = require('puppeteer');
function run () {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo");
let urls = await page.evaluate(() => {
let items = document.querySelector('h1.br-hdng span');
return items.innerText;;
})
browser.close();
return resolve(urls);
} catch (e) {
return reject(e);
}
})
}
run().then(console.log).catch(console.error);
再次,我使用 python
获得成功:
Again, I got success using python
:
import requests_html
with requests_html.HTMLSession() as session:
r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo')
r.html.render()
item = r.html.find("h1.br-hdng span",first=True).text
print(item)
然后 php
怎么了?
推荐答案
很可能是您的代码中的问题比我在此答案中讨论的要多,但是我看到的最突出的问题是:
It could very well be that there are more issues with your code than I have covered in this answer, but the most prominent issue that I see is the following:
DOMDocument :: loadHTML()
不是静态方法,而是实例方法(返回布尔值)。您应该首先创建 DOMDocument
的实例,然后在该实例上调用 loadHTML()
:
DOMDocument::loadHTML()
is not a static method, but an instance method (which returns a boolean). You should first create an instance of DOMDocument
and then call loadHTML()
on that instance:
$dom = new DOMDocument;
$dom->loadHTML($xml);
但是,由于您通过 @ $ c $抑制了错误在该特定行上的c>运算符,您没有收到关于此的警告。并且虽然很常见的是使用错误抑制操作符
@
来抑制HTML验证错误,但您应该考虑使用 libxml_use_internal_errors()
1 ,因为这不会抑制一般的PHP错误。
However, since you have suppressed errors with the @
operator on that particular line, you are not receiving a warning about this. And although it's very commonly seen that the error suppressor operator @
is used to suppress HTML validation errors, like this, you should look into using libxml_use_internal_errors()
1 instead, as this does not suppress general PHP errors.
$dom = new DOMDocument;
$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML($xml);
libxml_use_internal_errors($oldSetting);
最后一点:
可以从带有 cURL ) noreferrer> DOMDocument :: loadHTMLFile()
,如果您的PHP安装被配置为允许通过配置设置 allow_url_fopen
。请注意,尽管出于安全原因通常会禁用此设置,但如果打算使用它,请谨慎使用。
As a final note:
It's possible to load a DOM document from a URL directly (without the need for cURL
) with DOMDocument::loadHTMLFile()
, if your PHP installation is configured to allow loading of URLs via the configuration setting allow_url_fopen
. Be aware though that this setting is often disabled for security reasons, so use it with care, if you plan on using it.
这是一个简单的测试用例,应能按预期工作:
Here's a simple test-case which should work as expected:
<?php
$html = '
<html>
<head>
<title>DOMDocument test-case</title>
</head>
<body>
<div class="dummy-container">
<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>
</div>
</body>';
$dom = new DOMDocument;
$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML( $html );
libxml_use_internal_errors($oldSetting);
$xpath = new DOMXPath( $dom );
$title = $xpath->query( '//h1[@class="br-hdng"]/span' )->item( 0 )->nodeValue;
echo $title;
您应替换 $ html
和您的 get_content()
调用的输出。如果它不起作用,则可以执行以下操作:
You should replace the contents of $html
with the output of your get_content()
call. If it doesn't work, then either:
-
使用
cURL获取HTML时出现问题
(执行var_dump($ html);
,然后再加载到DOMDocument
中查看您检索到的内容),或者...
there's something wrong with fetching the HTML with
cURL
(dovar_dump( $html );
before loading intoDOMDocument
, for instance, to see the contents you retrieved), or...
也许您正在命名空间中工作,在这种情况下,应在<$ c $之前加一个反斜杠。 c> DOMDocument 和 DOMXPath
,即: new \DOMDocument;
和新的\DOMXPath($ dom);
。
perhaps you are working inside a namespace, in which case you should prepend a backslash before DOMDocument
and DOMXPath
, i.e.: new \DOMDocument;
and new \DOMXPath( $dom );
.
1。 LibXML是DOMDocument用来解析XML / HTML文档的XML库。
这篇关于无法从网页获取某些标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!