XPath查询&HTML - 在锚标签中查找特定的 HREF [英] XPath Query & HTML - Find Specific HREF's Within Anchor Tags

查看:29
本文介绍了XPath查询&HTML - 在锚标签中查找特定的 HREF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 DOMDocumentDOMXPath 中获得了所需的 HTML 数据.

I've got the HTML data required in a DOMDocument and DOMXPath.

但我需要访问和检索某些 标签中的 href 值.以下是标准:

But I need to access and retrieve the href values in certain <a> tags. The following is the criteria:

  1. href 包含:some-site.vendor.com/jobs/[#idnumber]/job(即 some-site.vendor.com/jobs/23094/job)

href 不包含:some-site.vendor.com/jobs/search?search=pr2

href 不包含:some-site.vendor.com/jobs/intro

href 不包含:www.someothersite.com/

href 不包含:media.someothersite.com/

href 不包含:javascript:void(0)

这些(类似的)查询中的任何一个都可以获取除 4-6 之外的所有内容 - 这是一件好事:

Either of these (similar) queries fetches everything but 4-6 - that's a good thing:

$joblinks = $xpath->query('//a[@href[contains(., "https://some-site.vendor.com/jobs/")]]');    
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');

不过,最终我需要访问所有包含 href 的锚标记,例如 #1,并将其中的实际 href 值分配给变量/数组.这是我正在做的:

Ultimately however I need to access all the anchor tags containing href's like #1, and assign the actual href values within to a variable/array. Here's what I'm doing:

$payload = fetchRemoteData(SPEC_SOURCE_URL);

// suppress warning(s) due to malformed markup
libxml_use_internal_errors(true);

// load the fetched contents
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($payload);

// parse and cache the required data elements
$xpath = new DOMXPath($dom);

//$joblinks = $xpath->query('//a[@href[contains(., "some-site.vendor.com/jobs/")]]');
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');
foreach($joblinks as $joblink) {
    var_dump(trim($joblink->nodeValue)); // dump hrefs here!
}
echo "\n";

这真的让我很生气 - 我很接近,但我似乎无法正确调整查询和/或访问实际的 href 值.如果我没有遵循这个问题的任何协议,我最谦虚的道歉......

This is really beating me up - I'm close but I just can't seem to tweak the query correctly and/or access the actual href values. My humblest apologies if I've not followed protocol of any sorts for this question...

任何/所有帮助将不胜感激!提前谢谢!

ANY/ALL help would be greatly appreciated! Thanx SO MUCH in advance!

推荐答案

仅使用 xpath 执行此操作我不建议.首先,您有一个白名单和一个黑名单.不太清楚您想要什么,所以我认为这会随着时间的推移而改变.

Doing this solely with xpath I would not suggest. First of all you have a whitelist and a blacklist. It's not really clear what you want so I assume this can change over time.

所以你可以做的是首先选择所有有问题的 href 属性并返回节点.这就是 Xpath 非常有用的地方,所以让我们使用 xpath:

So what you can do is to first select all href attributes in question and return the nodes. That's what Xpath is very good for, so let's use xpath:

if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

您现在在 $links 中有了通用的 DOMNodeList 并且它包含零个或多个我们选择的 DOMAttr 元素.这些现在需要您正在寻找的过滤.

You now have the common DOMNodeList in $links and it contains of zero or more DOMAttr elements as we have selected those. These now needs the filtering you're looking for.

所以您有一些想要匹配的标准.你有冗长但不是很具体应该如何工作.您有正匹配,但也有负匹配.但在这两种情况下,你都不会告诉如果没有会发生什么.所以我在这里做了一个捷径:你自己写一个函数,如果 "href" 字符串与条件匹配,则返回 truefalse):

So you have some critera you want to match. You have verbose but not very specific how that should work. You have a positive match but also negative matches. But in both cases you don't tell what should happen if not. So I do a shortcut here: You write yourself a function that returns either true or false if a "href" string matches the criteria(s):

function is_valid_href($href) {

    // do whatever you see fit ...

    return true or false;
}

所以判断 href 现在是否有效的问题已经解决了.最好的事情:您可以稍后更改.

So the problem of telling whether a href is now valid or not has been solved. Best thing: You can change it later.

因此,所有需要的是将其与链接集成,以获取所有链接的规范化和绝对形式.这意味着更多的数据处理,请参阅:

So all what's needed is to integrate that with the links is to get all links in their normalized and absolute form. This means more data processing, see:

有关不同类型的 URL 规范化的更多详细信息.

for more details about the different types of URL normalization.

所以我们创建了另一个函数来封装 href 规范化、基本解析和验证.如果 href 错误,则只返回 null,否则返回规范化的 href:

So we create another function that encapsulates away href normalization, base resolution and validation. In case the href is wrong, it just returns null, otherwise the normalized href:

function normalize_href($href, $base) {

    // do whatever is needed ...

    return null or "href string";
}

让我们把它们放在一起,就我而言,我什至将 href 设为 Net_URL2 实例,以便验证器可以从中受益.

Let's put this together, in my case I even make the href a Net_URL2 instance so the validator can benefit from it.

当然,如果你把它包装成闭包或一些类,它会得到一个更好的接口.您也可以考虑将 xpath 表达式作为参数:

Naturally if you wrap this up into closures or some classes, it get's a nicer interface. Also you couold consider to make the xpath expression a parameter as well:

// get all href
if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

// set a base URL
$base = 'https://stackoverflow.com/questions/9894956/xpath-query-html-find-specific-hrefs-within-anchor-tags';

/**
 * @return bool
 */
function is_valid_href($href) {    
    ...
}

/**
 * @return href
 */
function normalize_href($href, $base) {
    ...
}

$joblinks = array();
foreach ($links as $attr) {
    $href = normalize_href($attr->nodeValue, $base);
    if (is_valid_href($href)) {
        $joblinks[] = $href;
    }
}

// your result is in:
var_dump($joblinks);

我在这个网站上运行了一个例子,结果是:

I've run an example on this website, and the result is:

array(122) {
  [0]=>
  object(Net_URL2)#129 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(17) "stackexchange.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(1) "/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
  [1]=> 

  ...

  [121]=>
  object(Net_URL2)#250 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(22) "blog.stackoverflow.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(30) "/2009/06/attribution-required/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
}

这篇关于XPath查询&amp;HTML - 在锚标签中查找特定的 HREF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆