使用 DOMXPath 查询方法抓取网站时，如何解决缺少的 xPath 并保持数据统一? [英] How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

查看：32 发布时间：2021/6/6 18:46:13 php mysql xpath web-scraping domxpath

本文介绍了使用 DOMXPath 查询方法抓取网站时，如何解决缺少的 xPath 并保持数据统一?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 DOMXPath 查询方法抓取网站.我已经成功地从这个页面抓取了每个新闻主播的 20 个个人资料 URL.

I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

我使用结果数组作为 URL，从每个新闻主播的个人简介页面中抓取数据.

I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

每个新闻主播个人资料页面都有 6 个我需要抓取的 xPath($imgurl 数组就是其中之一).然后我将这些抓取的数据发送到 MySQL.

Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.

到目前为止，一切都很好 - 除了当我尝试从每个个人资料中获取 Twitter URL 时，因为在每个新闻主播个人资料页面上都找不到此元素.这导致 MySQL 收到 5 列(包含 20 行完整数据)和 1 列(twitterurl)(包含 18 行数据).这 18 行没有与其他数据正确对齐，因为如果 xPath 不存在，它似乎被跳过.

So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.

如何解决缺少的 xPath?寻找答案时，我发现有人说:nodeValue 永远不能为空，因为没有值，节点将不存在."考虑到这一点，如果没有 nodeValue，我如何以编程方式识别这些 xPath 何时不存在，并在循环到下一次迭代之前用其他一些默认值填充该迭代?

How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?

这是对 Twitter URL 的查询:

Here's the query for the Twitter URLs:

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

使用 DOMXPath 查询方法抓取网站时，如何解决缺少的 xPath 并保持数据统一? [英] How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

使用 DOMXPath 查询方法抓取网站时，如何解决缺少的 xPath 并保持数据统一? [英] How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭