无法从网页获取某些标题 [英] Trouble fetching some title from a webpage

查看:169
本文介绍了无法从网页获取某些标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用php编写了一个脚本,以从 hair fall shamboo 中刮出 title 网页。当我执行下面的脚本时,出现以下错误:

I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:


注意:试图获取C中非对象的属性 nodeValue :第16行的xampp\htdocs\runco​​de\testfile.php。

Notice: Trying to get property 'nodeValue' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.

链接到该网站

我编写的脚本尝试使用:

Script I've tried with:

<?php
    function get_content($url){
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_exec($ch);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        return $htmlContent;
    }
    $link = "https://www.purplle.com/search?q=hair%20fall%20shamboo"; 
    $xml = get_content($link);
    $dom = @DOMDocument::loadHTML($xml);
    $xpath = new DOMXPath($dom);
    $title = $xpath->query('//h1[@class="br-hdng"]/span')->item(0)->nodeValue;
    echo "{$title}";
?>

我的预期输出是:

hair fall shamboo

尽管 xpath 我在上述脚本中使用的似乎是正确的,我在此处粘贴了html元素的相关部分,可以在其中找到 title >

Although the xpath I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title can be found:

<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>

PostScript: title 我想解析的是动态加载的。由于我是php新手,所以我不了解我尝试的方法是否正确。如果不是,那我该怎么办?

PostScript: The title I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?


以下是我使用两种不同语言创建的脚本,发现它们像魔术一样工作。

The following are the scripts I've created using two different languages and found them working like magic.

我使用 javascript取得了成功

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo");
            let urls = await page.evaluate(() => {
            let items = document.querySelector('h1.br-hdng span');
            return items.innerText;;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

再次,我使用 python 获得成功:

Again, I got success using python:

import requests_html

with requests_html.HTMLSession() as session:
    r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo')
    r.html.render()
    item = r.html.find("h1.br-hdng span",first=True).text
    print(item)

然后 php 怎么了?

推荐答案

很可能是您的代码中的问题比我在此答案中讨论的要多,但是我看到的最突出的问题是:

It could very well be that there are more issues with your code than I have covered in this answer, but the most prominent issue that I see is the following:

DOMDocument :: loadHTML() 不是静态方法,而是实例方法(返回布尔值)。您应该首先创建 DOMDocument 的实例,然后在该实例上调用 loadHTML()

DOMDocument::loadHTML() is not a static method, but an instance method (which returns a boolean). You should first create an instance of DOMDocument and then call loadHTML() on that instance:

$dom = new DOMDocument;
$dom->loadHTML($xml);

但是,由于您通过 @ 运算符,您没有收到关于此的警告。并且虽然很常见的是使用错误抑制操作符 @ 来抑制HTML验证错误,但您应该考虑使用 libxml_use_internal_errors() 1 ,因为这不会抑制一般的PHP错误。

However, since you have suppressed errors with the @ operator on that particular line, you are not receiving a warning about this. And although it's very commonly seen that the error suppressor operator @ is used to suppress HTML validation errors, like this, you should look into using libxml_use_internal_errors()1 instead, as this does not suppress general PHP errors.

$dom = new DOMDocument;
$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML($xml);
libxml_use_internal_errors($oldSetting);

最后一点:

可以从带有 cURL
) noreferrer> DOMDocument :: loadHTMLFile() ,如果您的PHP安装被配置为允许通过配置设置 allow_url_fopen 。请注意,尽管出于安全原因通常会禁用此设置,但如果打算使用它,请谨慎使用。

As a final note:
It's possible to load a DOM document from a URL directly (without the need for cURL) with DOMDocument::loadHTMLFile(), if your PHP installation is configured to allow loading of URLs via the configuration setting allow_url_fopen. Be aware though that this setting is often disabled for security reasons, so use it with care, if you plan on using it.

这是一个简单的测试用例,应能按预期工作:

Here's a simple test-case which should work as expected:

<?php

$html = '
<html>
<head>
  <title>DOMDocument test-case</title>
</head>
<body>
  <div class="dummy-container">
    <h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>
  </div>
</body>';

$dom = new DOMDocument;

$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML( $html );
libxml_use_internal_errors($oldSetting);

$xpath = new DOMXPath( $dom );
$title = $xpath->query( '//h1[@class="br-hdng"]/span' )->item( 0 )->nodeValue;
echo $title;

请参见在3v4l.org上在线解释的示例

您应替换 $ html 和您的 get_content()调用的输出。如果它不起作用,则可以执行以下操作:

You should replace the contents of $html with the output of your get_content() call. If it doesn't work, then either:


  1. 使用 cURL获取HTML时出现问题(执行 var_dump($ html); ,然后再加载到 DOMDocument 中查看您检索到的内容),或者...

  1. there's something wrong with fetching the HTML with cURL (do var_dump( $html ); before loading into DOMDocument, for instance, to see the contents you retrieved), or...

也许您正在命名空间中工作,在这种情况下,应在<$ c $之前加一个反斜杠。 c> DOMDocument 和 DOMXPath ,即: new \DOMDocument; 新的\DOMXPath($ dom);

perhaps you are working inside a namespace, in which case you should prepend a backslash before DOMDocument and DOMXPath, i.e.: new \DOMDocument; and new \DOMXPath( $dom );.






1。 LibXML是DOMDocument用来解析XML / HTML文档的XML库。

这篇关于无法从网页获取某些标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆