使用 DOMXPath 查询方法抓取网站时,如何解决缺少的 xPath 并保持数据统一? [英] How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

查看:32
本文介绍了使用 DOMXPath 查询方法抓取网站时,如何解决缺少的 xPath 并保持数据统一?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 DOMXPath 查询方法抓取网站.我已经成功地从这个页面抓取了每个新闻主播的 20 个个人资料 URL.

I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

我使用结果数组作为 URL,从每个新闻主播的个人简介页面中抓取数据.

I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

每个新闻主播个人资料页面都有 6 个我需要抓取的 xPath($imgurl 数组就是其中之一).然后我将这些抓取的数据发送到 MySQL.

Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.

到目前为止,一切都很好 - 除了当我尝试从每个个人资料中获取 Twitter URL 时,因为在每个新闻主播个人资料页面上都找不到此元素.这导致 MySQL 收到 5 列(包含 20 行完整数据)和 1 列(twitterurl)(包含 18 行数据).这 18 行没有与其他数据正确对齐,因为如果 xPath 不存在,它似乎被跳过.

So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.

如何解决缺少的 xPath?寻找答案时,我发现有人说:nodeValue 永远不能为空,因为没有值,节点将不存在."考虑到这一点,如果没有 nodeValue,我如何以编程方式识别这些 xPath 何时不存在,并在循环到下一次迭代之前用其他一些默认值填充该迭代?

How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?

这是对 Twitter URL 的查询:

Here's the query for the Twitter URLs:

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

推荐答案

由于twitter节点出现0次或1次,将foreach改为

Since the twitter node appears zero or one times, change the foreach to

$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;

这将使内容保持同步.但是,您必须安排处理用于将它们插入数据库的查询中的 NULL 值.

That will keep the contents in sync. You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database.

这篇关于使用 DOMXPath 查询方法抓取网站时,如何解决缺少的 xPath 并保持数据统一?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆