使用HTMLAgilityPack和XPath进行选择性屏幕抓取 [英] Selective screen scraping with HTMLAgilityPack and XPath

查看:51
本文介绍了使用HTMLAgilityPack和XPath进行选择性屏幕抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[此问题的亲戚生活在:使用htmlAgilityPack和XPath进行屏幕抓取]

[This question has a relative that lives at: Screen scraping with htmlAgilityPack and XPath ]

我要解析一些HTML,这些HTML的外观如下:

I have some HTML to parse which has general appearance as follow:

...
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
...

我正在寻找一种方法,可以将其解析为有意义的块,但我希望具有选择性数据,例如前两个td数据和后两个td数据:

I am looking for a way where I can parse it down in meaningful chunks but I would like to have selective data like first two td data and last two td-data:

(1),(2),(6),{1} CRLF
(1),(2),(6),{1} CRLF
等等

(1), (2), (6), {1}CRLF
(1), (2), (6), {1}CRLF
and so on

我尝试了两种方法: 方式1:

I have tried two ways: way 1:

var dataList = currentDoc.DocumentNode.Descendants("tr")
            .Select
             (
              tr => tr.Descendants("td").Select(td => td.InnerText).ToList()
             ).ToList();

这使我获取了tds的内部文本,但未能获取链接{1}.在这里,创建了一个包含很多列表的列表.我可以使用嵌套的foreach对其进行管理.

which fetches me the inner text of the tds, but fails to fetch the link {1}. Here, a list is created which contains a lot of lists. I can manage it using nested foreach.

方式2:

var dataList = currentDoc.DocumentNode
           .SelectNodes("//tr//td//text()|//tr//td//a//@href");

这确实为我提供了链接{1}和所有数据,但是却变得井井有条.在这里,所有数据都以大块存在.由于一个tr中的数据是相对的,因此我现在松开该关系.

which does get me the link {1} and all data but it becomes unorganized. Here, all the data is present in big chunk. Since, the data in one tr is relative, I now loose that relation.

那么,如何才能获得我感兴趣的数据,只有前两列的数据和后两列的数据?

So, how can I get the data that I am interested in, only the first two columns and last two columns data?

推荐答案

以下代码将选择前两个<td>数据和后两个<td>节点数据:

Following code will select first two <td> data and last two <td> nodes data:

html.DocumentNode.Descendants("tr")
    .Select(tr => 
       from td in tr.SelectNodes("td[position() < 3 or position() > last() - 2]")
       let a = td.SelectSingleNode("a[@href!='']")
       select a == null ? td.InnerText : a.Attributes["href"].Value);

此xpath按位置过滤节点:

This xpath is filtering nodes by position:

td[position() < 3 or position() > last() - 2]

这篇关于使用HTMLAgilityPack和XPath进行选择性屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆