自上而下或自下而上的方法来搜索一个HTML DOM文档元素? [英] Top down or bottom up approach to search elements on a HTML DOM document?

查看:162
本文介绍了自上而下或自下而上的方法来搜索一个HTML DOM文档元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我使用递归循环的DOM元素(s)表示,将在整个半结构化和半统一的HTML DOM文档工作,从网站的弹性发现和位置。

Assuming I am using recursive loop for resilient discovery and location of DOM element(s) that will work across semi-structured and semi-uniform HTML DOM documents from a website.

例如,爬在网站的链接和整个小的变化到来之时的XPath位置。弹性希望允许灵活的未中断的爬行。

For example, when crawling links on a website and coming across small variations in it's xpath location. Resilience is desired to allow flexible un-interrupted crawling.

1)我知道我想要一个链接,   位于的一个特定区域   页面的其余部分区分开   (例如,菜单中的页脚,标题等)

1) I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)

2)这是因为它区分   似乎是一个表内,并   pargraph或容器。

2) It's distinguishable since it appears to be inside a table and pargraph or container.

3)有可能是一个可以接受的水平   意想不到的父母或子女   在此之前所需的链接中提到的    1),但我不知道是什么。更多   意外的元素将意味着   从出发 1)

3) There can be an acceptable level of unexpected parents or children before this desired link mentioned in 1) but I don't know what. More unexpected elements would mean departure from 1).

4)通过元素的ID识别和   类或任何其他独特的属性   是不希望的值。

4) Identifying via element's id and class or any other unique attribute value is not desired.

我觉得下面的XPath要总结:

I think the following xpath should sum up:

/`/p/table/tr/td/a`

在一些网页上有变化的XPath的,但它仍然有资格作为1)所需的链接

on some pages there is variations to the xpath but it still qualifies as 1) desired link

// P / DIV /表/ TR / TD / A // P / DIV / SPAN / SPAN /表/ TR / TD / B / A

我已经使用缩进模仿每次循环迭代(

I have used indentation to mimic each loop iteration (

我应该用plurral或单?儿童VS孩子。父母VS父母。我觉得奇异有意义的,因为眼前的父母或子女是值得关注的位置。的)

TOP DOWN正在搜索

how many p's are there ?
 how many these p's have table as child ? If none, search next sub level. 
   how many these table's have tr as child ? If none, search next sub level.
     how many these tr have td as child ? If none, search next sub level.
      how many these td have a as child ? 

自下而上正在搜索

how many a's are there ?
 how many of these a's have td as parent ? If none, look up to the next super level.
  how many of these td have tr as parent ? If none, look up to the next super level.
   how many of these tr have table as parent ? If none, look up to the next super level.
    how many of these table have p as a parent ? If none, look up to the next super level.

它的问题,如果是自上而下还是自下而上?我觉得自上而下是无用的,低效率的,如果它变成由环路的末端,没有发现所希望的锚链路

Does it matter if it's top down or bottom up ? I feel that top down is useless and inefficient if it turns by the end of the loop, the desired anchor link is not found.

我想我会还测量了多少意想不到的父母或子女被发现在每次循环,并会比作preset的常量,我很舒服与前)说,不超过2。如果有3或者更想不到的父母或子女我希望锚链接的发现之前的迭代中,我会以为这是我要寻找没有的东西。

I think I would also measure how many unexpected parents or children were discovered in each iteration of the loop and would compare to a preset constant that I am comfortable with ex) say no more than 2. If there are 3 or more unexpected parents or children iterations before the discovery of my desired anchor link, I would assume it's not what I am looking for.

这是正确的做法?这仅仅是东西,我来到了我的头顶。我很抱歉,如果这个问题不明确,我已经尽了全力。我很想得到一些输入该算法。

Is this the correct approach ? This is just something that I came up with on top of my head. I apologize if this problem is not clear, I have tried my best. I would love to get some input on this algorithm.

推荐答案

看来你想要的东西,像

//p//table//a

如果您有限制的路径中的中间元素的数目,再说不超过2,那么上面会被修改为

//p[not(ancestor::*[3])]
      //table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
               /tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]

这将选择所有的 A 元素,其父母,祖父母,父母是 D ,其父是一个 TR ,其父是一个表格,其父母或祖父母是 P 有不到3 ancesstor - 元素

This selects all a elements whose parent or grand-parent is td, whose parent is a tr, whose parent is a table, whose parent or grandparent is a p that has less than 3 ancesstor - elements.

这篇关于自上而下或自下而上的方法来搜索一个HTML DOM文档元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆