XPath查询&HTML - 在锚标签中查找特定的 HREF [英] XPath Query & HTML - Find Specific HREF's Within Anchor Tags

查看：29 发布时间：2021/10/2 19:44:49 php xpath

本文介绍了XPath查询&HTML - 在锚标签中查找特定的 HREF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 DOMDocument 和 DOMXPath 中获得了所需的 HTML 数据.

I've got the HTML data required in a DOMDocument and DOMXPath.

但我需要访问和检索某些标签中的 href 值.以下是标准:

But I need to access and retrieve the href values in certain <a> tags. The following is the criteria:

href 包含:some-site.vendor.com/jobs/[#idnumber]/job(即 some-site.vendor.com/jobs/23094/job)

href 不包含:some-site.vendor.com/jobs/search?search=pr2

href 不包含:some-site.vendor.com/jobs/intro

href 不包含:www.someothersite.com/

href 不包含:media.someothersite.com/

href 不包含:javascript:void(0)

这些(类似的)查询中的任何一个都可以获取除 4-6 之外的所有内容 - 这是一件好事:

Either of these (similar) queries fetches everything but 4-6 - that's a good thing:

$joblinks = $xpath->query('//a[@href[contains(., "https://some-site.vendor.com/jobs/")]]');    
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');

不过，最终我需要访问所有包含 href 的锚标记，例如 #1，并将其中的实际 href 值分配给变量/数组.这是我正在做的:

Ultimately however I need to access all the anchor tags containing href's like #1, and assign the actual href values within to a variable/array. Here's what I'm doing:

$payload = fetchRemoteData(SPEC_SOURCE_URL);

// suppress warning(s) due to malformed markup
libxml_use_internal_errors(true);

// load the fetched contents
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($payload);

// parse and cache the required data elements
$xpath = new DOMXPath($dom);

//$joblinks = $xpath->query('//a[@href[contains(., "some-site.vendor.com/jobs/")]]');
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');
foreach($joblinks as $joblink) {
    var_dump(trim($joblink->nodeValue)); // dump hrefs here!
}
echo "\n";

这真的让我很生气 - 我很接近，但我似乎无法正确调整查询和/或访问实际的 href 值.如果我没有遵循这个问题的任何协议，我最谦虚的道歉......

This is really beating me up - I'm close but I just can't seem to tweak the query correctly and/or access the actual href values. My humblest apologies if I've not followed protocol of any sorts for this question...

任何/所有帮助将不胜感激！提前谢谢！

ANY/ALL help would be greatly appreciated! Thanx SO MUCH in advance!

推荐答案

仅使用 xpath 执行此操作我不建议.首先，您有一个白名单和一个黑名单.不太清楚您想要什么，所以我认为这会随着时间的推移而改变.

Doing this solely with xpath I would not suggest. First of all you have a whitelist and a blacklist. It's not really clear what you want so I assume this can change over time.

所以你可以做的是首先选择所有有问题的 href 属性并返回节点.这就是 Xpath 非常有用的地方，所以让我们使用 xpath:

So what you can do is to first select all href attributes in question and return the nodes. That's what Xpath is very good for, so let's use xpath:

if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

您现在在 $links 中有了通用的 DOMNodeList 并且它包含零个或多个我们选择的 DOMAttr 元素.这些现在需要您正在寻找的过滤.

You now have the common DOMNodeList in $links and it contains of zero or more DOMAttr elements as we have selected those. These now needs the filtering you're looking for.

所以您有一些想要匹配的标准.你有冗长但不是很具体应该如何工作.您有正匹配，但也有负匹配.但在这两种情况下，你都不会告诉如果没有会发生什么.所以我在这里做了一个捷径:你自己写一个函数，如果 "href" 字符串与条件匹配，则返回 true 或 false):

So you have some critera you want to match. You have verbose but not very specific how that should work. You have a positive match but also negative matches. But in both cases you don't tell what should happen if not. So I do a shortcut here: You write yourself a function that returns either true or false if a "href" string matches the criteria(s):

function is_valid_href($href) {

    // do whatever you see fit ...

    return true or false;
}

所以判断 href 现在是否有效的问题已经解决了.最好的事情:您可以稍后更改.

So the problem of telling whether a href is now valid or not has been solved. Best thing: You can change it later.

因此，所有需要的是将其与链接集成，以获取所有链接的规范化和绝对形式.这意味着更多的数据处理，请参阅:

So all what's needed is to integrate that with the links is to get all links in their normalized and absolute form. This means more data processing, see:

有关不同类型的 URL 规范化的更多详细信息.

for more details about the different types of URL normalization.

所以我们创建了另一个函数来封装 href 规范化、基本解析和验证.如果 href 错误，则只返回 null，否则返回规范化的 href:

So we create another function that encapsulates away href normalization, base resolution and validation. In case the href is wrong, it just returns null, otherwise the normalized href:

function normalize_href($href, $base) {

    // do whatever is needed ...

    return null or "href string";
}

让我们把它们放在一起，就我而言，我什至将 href 设为 Net_URL2 实例，以便验证器可以从中受益.

Let's put this together, in my case I even make the href a Net_URL2 instance so the validator can benefit from it.

当然，如果你把它包装成闭包或一些类，它会得到一个更好的接口.您也可以考虑将 xpath 表达式作为参数:

Naturally if you wrap this up into closures or some classes, it get's a nicer interface. Also you couold consider to make the xpath expression a parameter as well:

// get all href
if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

// set a base URL
$base = 'https://stackoverflow.com/questions/9894956/xpath-query-html-find-specific-hrefs-within-anchor-tags';

/**
 * @return bool
 */
function is_valid_href($href) {    
    ...
}

/**
 * @return href
 */
function normalize_href($href, $base) {
    ...
}

$joblinks = array();
foreach ($links as $attr) {
    $href = normalize_href($attr->nodeValue, $base);
    if (is_valid_href($href)) {
        $joblinks[] = $href;
    }
}

// your result is in:
var_dump($joblinks);

我在这个网站上运行了一个例子，结果是:

I've run an example on this website, and the result is:

array(122) {
  [0]=>
  object(Net_URL2)#129 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(17) "stackexchange.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(1) "/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
  [1]=> 

  ...

  [121]=>
  object(Net_URL2)#250 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(22) "blog.stackoverflow.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(30) "/2009/06/attribution-required/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
}

这篇关于XPath查询&HTML - 在锚标签中查找特定的 HREF的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

XPath查询&HTML - 在锚标签中查找特定的 HREF [英] XPath Query & HTML - Find Specific HREF's Within Anchor Tags

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

XPath查询&amp;HTML - 在锚标签中查找特定的 HREF [英] XPath Query &amp; HTML - Find Specific HREF&#39;s Within Anchor Tags

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

XPath查询&HTML - 在锚标签中查找特定的 HREF [英] XPath Query & HTML - Find Specific HREF's Within Anchor Tags

登录关闭