自上而下或自下而上的方法来搜索一个HTML DOM文档元素? [英] Top down or bottom up approach to search elements on a HTML DOM document?
问题描述
假设我使用递归循环的DOM元素(s)表示,将在整个半结构化和半统一的HTML DOM文档工作,从网站的弹性发现和位置。
Assuming I am using recursive loop for resilient discovery and location of DOM element(s) that will work across semi-structured and semi-uniform HTML DOM documents from a website.
例如,爬在网站的链接和整个小的变化到来之时的XPath位置。弹性希望允许灵活的未中断的爬行。
For example, when crawling links on a website and coming across small variations in it's xpath location. Resilience is desired to allow flexible un-interrupted crawling.
1)
我知道我想要一个链接,
位于的一个特定区域
页面的其余部分区分开
(例如,菜单中的页脚,标题等)
1)
I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)
2)
这是因为它区分
似乎是一个表内,并
pargraph或容器。
2)
It's distinguishable since it
appears to be inside a table and
pargraph or container.
3)
有可能是一个可以接受的水平
意想不到的父母或子女
在此之前所需的链接中提到的
1)
,但我不知道是什么。更多
意外的元素将意味着
从出发 1)
。
3)
There can be an acceptable level
of unexpected parents or children
before this desired link mentioned in
1)
but I don't know what. More
unexpected elements would mean
departure from 1)
.
4)
通过元素的ID识别和
类或任何其他独特的属性
是不希望的值。
4)
Identifying via element's id and
class or any other unique attribute
value is not desired.
我觉得下面的XPath要总结:
I think the following xpath should sum up:
/`/p/table/tr/td/a`
在一些网页上有变化的XPath的,但它仍然有资格作为1)所需的链接
on some pages there is variations to the xpath but it still qualifies as 1) desired link
// P / DIV /表/ TR / TD / A
或 // P / DIV / SPAN / SPAN /表/ TR / TD / B / A
我已经使用缩进模仿每次循环迭代(
I have used indentation to mimic each loop iteration (
(我应该用plurral或单?儿童VS孩子。父母VS父母。我觉得奇异有意义的,因为眼前的父母或子女是值得关注的位置。的)
TOP DOWN正在搜索
how many p's are there ?
how many these p's have table as child ? If none, search next sub level.
how many these table's have tr as child ? If none, search next sub level.
how many these tr have td as child ? If none, search next sub level.
how many these td have a as child ?
自下而上正在搜索
how many a's are there ?
how many of these a's have td as parent ? If none, look up to the next super level.
how many of these td have tr as parent ? If none, look up to the next super level.
how many of these tr have table as parent ? If none, look up to the next super level.
how many of these table have p as a parent ? If none, look up to the next super level.
它的问题,如果是自上而下还是自下而上?我觉得自上而下是无用的,低效率的,如果它变成由环路的末端,没有发现所希望的锚链路
Does it matter if it's top down or bottom up ? I feel that top down is useless and inefficient if it turns by the end of the loop, the desired anchor link is not found.
我想我会还测量了多少意想不到的父母或子女被发现在每次循环,并会比作preset的常量,我很舒服与前)说,不超过2。如果有3或者更想不到的父母或子女我希望锚链接的发现之前的迭代中,我会以为这是我要寻找没有的东西。
I think I would also measure how many unexpected parents or children were discovered in each iteration of the loop and would compare to a preset constant that I am comfortable with ex) say no more than 2. If there are 3 or more unexpected parents or children iterations before the discovery of my desired anchor link, I would assume it's not what I am looking for.
这是正确的做法?这仅仅是东西,我来到了我的头顶。我很抱歉,如果这个问题不明确,我已经尽了全力。我很想得到一些输入该算法。
Is this the correct approach ? This is just something that I came up with on top of my head. I apologize if this problem is not clear, I have tried my best. I would love to get some input on this algorithm.
推荐答案
看来你想要的东西,像
//p//table//a
如果您有限制的路径中的中间元素的数目,再说不超过2,那么上面会被修改为
//p[not(ancestor::*[3])]
//table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
/tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]
这将选择所有的 A
元素,其父母,祖父母,父母是 D
,其父是一个 TR
,其父是一个表格
,其父母或祖父母是 P
有不到3 ancesstor - 元素
This selects all a
elements whose parent or grand-parent is td
, whose parent is a tr
, whose parent is a table
, whose parent or grandparent is a p
that has less than 3 ancesstor - elements.
这篇关于自上而下或自下而上的方法来搜索一个HTML DOM文档元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!