XPath查询&HTML - 在锚标签中查找特定的 HREF [英] XPath Query & HTML - Find Specific HREF's Within Anchor Tags
问题描述
我在 DOMDocument
和 DOMXPath
中获得了所需的 HTML 数据.
I've got the HTML data required in a DOMDocument
and DOMXPath
.
但我需要访问和检索某些 标签中的
href
值.以下是标准:
But I need to access and retrieve the href
values in certain <a>
tags. The following is the criteria:
href
包含:some-site.vendor.com/jobs/[#idnumber]/job
(即some-site.vendor.com/jobs/23094/job
)
href
不包含:some-site.vendor.com/jobs/search?search=pr2
href
不包含:some-site.vendor.com/jobs/intro
href
不包含:www.someothersite.com/
href
不包含:media.someothersite.com/
href
不包含:javascript:void(0)
这些(类似的)查询中的任何一个都可以获取除 4-6 之外的所有内容 - 这是一件好事:
Either of these (similar) queries fetches everything but 4-6 - that's a good thing:
$joblinks = $xpath->query('//a[@href[contains(., "https://some-site.vendor.com/jobs/")]]');
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');
不过,最终我需要访问所有包含 href 的锚标记,例如 #1,并将其中的实际 href 值分配给变量/数组.这是我正在做的:
Ultimately however I need to access all the anchor tags containing href's like #1, and assign the actual href values within to a variable/array. Here's what I'm doing:
$payload = fetchRemoteData(SPEC_SOURCE_URL);
// suppress warning(s) due to malformed markup
libxml_use_internal_errors(true);
// load the fetched contents
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($payload);
// parse and cache the required data elements
$xpath = new DOMXPath($dom);
//$joblinks = $xpath->query('//a[@href[contains(., "some-site.vendor.com/jobs/")]]');
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');
foreach($joblinks as $joblink) {
var_dump(trim($joblink->nodeValue)); // dump hrefs here!
}
echo "\n";
这真的让我很生气 - 我很接近,但我似乎无法正确调整查询和/或访问实际的 href 值.如果我没有遵循这个问题的任何协议,我最谦虚的道歉......
This is really beating me up - I'm close but I just can't seem to tweak the query correctly and/or access the actual href values. My humblest apologies if I've not followed protocol of any sorts for this question...
任何/所有帮助将不胜感激!提前谢谢!
ANY/ALL help would be greatly appreciated! Thanx SO MUCH in advance!
推荐答案
仅使用 xpath
执行此操作我不建议.首先,您有一个白名单和一个黑名单.不太清楚您想要什么,所以我认为这会随着时间的推移而改变.
Doing this solely with xpath
I would not suggest. First of all you have a whitelist and a blacklist. It's not really clear what you want so I assume this can change over time.
所以你可以做的是首先选择所有有问题的 href
属性并返回节点.这就是 Xpath 非常有用的地方,所以让我们使用 xpath:
So what you can do is to first select all href
attributes in question and return the nodes. That's what Xpath is very good for, so let's use xpath:
if (!$links = $xpath->query('//a/@href')) {
throw new Exception('XPath query failed.');
}
您现在在 $links
中有了通用的 DOMNodeList
并且它包含零个或多个我们选择的 DOMAttr
元素.这些现在需要您正在寻找的过滤.
You now have the common DOMNodeList
in $links
and it contains of zero or more DOMAttr
elements as we have selected those. These now needs the filtering you're looking for.
所以您有一些想要匹配的标准.你有冗长但不是很具体应该如何工作.您有正匹配,但也有负匹配.但在这两种情况下,你都不会告诉如果没有会发生什么.所以我在这里做了一个捷径:你自己写一个函数,如果 "href"
字符串与条件匹配,则返回 true
或 false
):
So you have some critera you want to match. You have verbose but not very specific how that should work. You have a positive match but also negative matches. But in both cases you don't tell what should happen if not. So I do a shortcut here: You write yourself a function that returns either true
or false
if a "href"
string matches the criteria(s):
function is_valid_href($href) {
// do whatever you see fit ...
return true or false;
}
所以判断 href
现在是否有效的问题已经解决了.最好的事情:您可以稍后更改.
So the problem of telling whether a href
is now valid or not has been solved. Best thing: You can change it later.
因此,所有需要的是将其与链接集成,以获取所有链接的规范化和绝对形式.这意味着更多的数据处理,请参阅:
So all what's needed is to integrate that with the links is to get all links in their normalized and absolute form. This means more data processing, see:
有关不同类型的 URL 规范化的更多详细信息.
for more details about the different types of URL normalization.
所以我们创建了另一个函数来封装 href 规范化、基本解析和验证.如果 href 错误,则只返回 null
,否则返回规范化的 href:
So we create another function that encapsulates away href normalization, base resolution and validation. In case the href is wrong, it just returns null
, otherwise the normalized href:
function normalize_href($href, $base) {
// do whatever is needed ...
return null or "href string";
}
让我们把它们放在一起,就我而言,我什至将 href 设为 Net_URL2
实例,以便验证器可以从中受益.
Let's put this together, in my case I even make the href a Net_URL2
instance so the validator can benefit from it.
当然,如果你把它包装成闭包或一些类,它会得到一个更好的接口.您也可以考虑将 xpath 表达式作为参数:
Naturally if you wrap this up into closures or some classes, it get's a nicer interface. Also you couold consider to make the xpath expression a parameter as well:
// get all href
if (!$links = $xpath->query('//a/@href')) {
throw new Exception('XPath query failed.');
}
// set a base URL
$base = 'https://stackoverflow.com/questions/9894956/xpath-query-html-find-specific-hrefs-within-anchor-tags';
/**
* @return bool
*/
function is_valid_href($href) {
...
}
/**
* @return href
*/
function normalize_href($href, $base) {
...
}
$joblinks = array();
foreach ($links as $attr) {
$href = normalize_href($attr->nodeValue, $base);
if (is_valid_href($href)) {
$joblinks[] = $href;
}
}
// your result is in:
var_dump($joblinks);
我在这个网站上运行了一个例子,结果是:
I've run an example on this website, and the result is:
array(122) {
[0]=>
object(Net_URL2)#129 (8) {
["_options":"Net_URL2":private]=>
array(5) {
["strict"]=>
bool(true)
["use_brackets"]=>
bool(true)
["encode_keys"]=>
bool(true)
["input_separator"]=>
string(1) "&"
["output_separator"]=>
string(1) "&"
}
["_scheme":"Net_URL2":private]=>
string(4) "http"
["_userinfo":"Net_URL2":private]=>
bool(false)
["_host":"Net_URL2":private]=>
string(17) "stackexchange.com"
["_port":"Net_URL2":private]=>
bool(false)
["_path":"Net_URL2":private]=>
string(1) "/"
["_query":"Net_URL2":private]=>
bool(false)
["_fragment":"Net_URL2":private]=>
bool(false)
}
[1]=>
...
[121]=>
object(Net_URL2)#250 (8) {
["_options":"Net_URL2":private]=>
array(5) {
["strict"]=>
bool(true)
["use_brackets"]=>
bool(true)
["encode_keys"]=>
bool(true)
["input_separator"]=>
string(1) "&"
["output_separator"]=>
string(1) "&"
}
["_scheme":"Net_URL2":private]=>
string(4) "http"
["_userinfo":"Net_URL2":private]=>
bool(false)
["_host":"Net_URL2":private]=>
string(22) "blog.stackoverflow.com"
["_port":"Net_URL2":private]=>
bool(false)
["_path":"Net_URL2":private]=>
string(30) "/2009/06/attribution-required/"
["_query":"Net_URL2":private]=>
bool(false)
["_fragment":"Net_URL2":private]=>
bool(false)
}
}
这篇关于XPath查询&HTML - 在锚标签中查找特定的 HREF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!