高级HTML Agility Pack使用 [英] Advanced HTML Agility Pack useage

查看:56
本文介绍了高级HTML Agility Pack使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对HTML Agility Pack很陌生,因此我需要一些下一步的帮助.我可以做一些简单的事情,例如从href中提取一个值(知道我正在寻找的url字符串),并且可以基于正在使用的特定类在跨度中提取类似的值.但是我不明白在没有成千上万的真正锚点的情况下,如何使用HTML Agility Pack?

I am pretty new to the HTML Agility Pack so I need some help with where to go next. I can do some simple things like pull a value from an href (knowing the url string I was looking for) and I can pull like the value in a span based on a specific class that was being used. But I do not understand how to use the HTML Agility Pack in a situation where there are a ton of or tags an thre is not one real solid anchor to tie to?

这是我正在抓取的实际代码块.我在单元格中放置了虚拟数据,以演示我在寻找什么.

Here is an actual chunk of code I am scraping through. I placed dummy data in the cells to demonstrate what I am looking for.

提取以下内容的最佳方法是什么?

What is the best way to extract the following:

1.)公司名称?

2.)电话号码?

3.)电子邮件地址?

3.) Email Address?

HTML ....

HTML....

<td>
  <!-- Company Info -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>COMPANY NAME</th>
          </tr>
          <tr>
            <td class="search">
              <table cellpadding="5" cellspacing="0" border="0" width="100%">
                <tr>
                  <td>
                    <table cellpadding="1" cellspacing="0" border="0" width="100%">
                      <tr>
                        <td colspan="2" align="center">Un-needed Links...</td>
                      </tr>
                      <tr>
                        <td align="center" colspan="2"><hr></td>
                      </tr>
                      <tr>
                        <td align="right" nowrap>
                          <b>
                            <font color="FF0000">
                              Contact Person&nbsp;
                              <img src="/images/icon_contact.gif" align="absmiddle">&nbsp;:
                            </font>
                          </b>
                        </td>
                        <td align="left" width="100%">&nbsp;Judy Smith</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap>
                        <b><font color="FF0000">Phone Number&nbsp;<img src="/images/icon_phone.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">E-mail Address&nbsp;<img src="/images/icon_email.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;<a HREF="mailto:judy.smith@companyname.com">judy.smith@companyname.com</a></td>
                      </tr>
                      <tr>
                        <td align="center" colspan="2"><hr></td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Home Office Location&nbsp;<img src="/images/icon_home.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;ATLANTA, GA</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Home Office Phone&nbsp;<img src="/images/icon_home.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Home Office Fax&nbsp;<img src="/images/icon_home.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;666-666-6666</td>
                      </tr>
                      <tr>
                        <td align="center" colspan="2"><hr></td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Broker MC Number&nbsp;<img src="/images/icon_number.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;123456</td>
                      </tr>
                      <tr>
                        <td align="right" nowrap><b><font color="FF0000">Carrier MC Number&nbsp;<img src="/images/icon_number.gif" align="absmiddle">&nbsp;:</font></b></td>
                        <td align="left" width="100%">&nbsp;654321</td>
                      </tr>
                    </table>
                  </td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
      </td>
    </tr>
  </table>
  <br>

  <!-- Starting Point -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>Starting Point</th>
            <th>Available</th>
          </tr>
          <tr>
            <td class="search" width="270">&nbsp;<b>ABBEVILLE, GA&nbsp;</b></td>
            <td class="search" align="center" width="100"><span style="color: forestgreen">&nbsp;1/5/11&nbsp;</span></td>
          </tr>
        </table>
      </td>
    </tr>

  </table>
  <br>
  <!-- Destination Point -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>Destination Point</th>
            <th>Direction</th>
          </tr>
          <tr>
            <td class="search" width="270">&nbsp;<b>ATLANTA, GA&nbsp;</b></td>
            <td class="search" align="center" width="100"><span style="color: FF0000">&nbsp;&nbsp;</span></td>
          </tr>
        </table>
      </td>

    </tr>
  </table>
  <br>
  <!-- Truck Details -->
  <table cellpadding="0" cellspacing="0" border="0">
    <tr>
      <td class="black">
        <table cellspacing="1" cellpadding="0" border="0" width="370">
          <tr>
            <th>Truck Details</th>
          </tr>
          <tr>
            <td class="search">
              <table cellpadding="5" cellspacing="0" border="0">
                <tr>
                  <td>
                    <table cellpadding="0" cellspacing="0" border="0">
                      <tr>
                        <td align="right"><b>Date Posted&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;1/5/2011 10:34:48 AM</td>
                      </tr>
                      <tr>
                        <td align="right"><b>Quantity&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;1</td>
                      </tr>
                      <tr>
                        <td align="right"><b>Equipment Type&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;FT</td>
                      </tr>
                      <tr>
                        <td align="right"><b>Load Size&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;Full</td>
                      </tr>
                      <tr>
                        <td align="right" valign="top"><b>Special Information&nbsp;:</b></td>
                        <td align="left">&nbsp;&nbsp;</td>
                      </tr>
                    </table>
                  </td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
      </td>
    </tr>
  </table>
  <br>
</td>

....更多HTML

....More HTML

推荐答案

好吧,您必须了解XPATH才能真正利用HTML敏捷包抓取功能:-)您可以在

Well, you have to understand XPATH to really take advandage of the HTML agility pack scraping capabilities :-) You can Google on XPATH examples to start with.

针对屏幕抓取问题,棘手的部分是为要获取的信息选择您认为是最有区别的xpath表达式.在大多数情况下,不仅有一种解决方案,而且您必须准备好更新代码以适应目标站点HTML的发展.

Focusing on the screen-scraping question, the tricky part is to select what you think is the most discriminant xpath expression for the information you want to get. Most of the time, there is not only one solution, and you must be prepared to update your code to stick with the target site HTML evolution.

因此,要在非常简单的表达式与可能匹配不需要的文本的风险之间进行权衡取舍,而对于过于区分的表达式(不能容忍已抓取的HTML中的演变),却又有一个风险,那就是它们什么都不匹配.

So it's a trade off between very simple expressions with a risk that they match unwanted texts, and too discriminant expressions, not tolerant with evolutions in the scraped HTML, with a risk that they match nothing.

对于您的特定文字,这是一个很好的现实示例,下面是执行此操作的代码:

As for your specific text, this is a good real world example, and here is a code that does it:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourText);

string companyName = doc.DocumentNode.SelectSingleNode("/td/table/tr/td/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);

// another way
companyName = doc.DocumentNode.SelectSingleNode("//td[@class='black']/table/tr/th").InnerText;
Console.WriteLine("company name=" + companyName);

// a more advanced XPATH expression, means
// "Select a TD tag anywhere in the doc that has a preceding sibling of TD type with a B chid, with a FONT child with inner text starting with 'Phone Number'"
string phoneNumber = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'Phone Number')]").InnerText;
Console.WriteLine("phone Number=" + phoneNumber);

// same kind of story but go down the next A tag
string email = doc.DocumentNode.SelectSingleNode("//td[starts-with(preceding-sibling::td/b/font/text(), 'E-mail')]/a").InnerText;
Console.WriteLine("email=" + email);

PS :请注意,HTML Agility Pack始终希望XPATH表达式中使用的标签是小写的,即使它们不在原始HTML文本中也是如此.

PS: please note the HTML Agility Pack always expect tags used in XPATH expressions to be lowercase, even if they're not in the original HTML text.

如您所见,此处使用两个不同的表达式检索公司名称.它们都对样本起作用,但是如果在中间的任何位置添加了新标签,第一个标签将无法抵抗.第二个是面向未来的,但它基于CSS类标记,该标记也可能会发生变化.总是要权衡的.

As you see, the company name is retrieved here using two different expressions. They both work on the sample, but the first one will not resist if a new tag is added anywhere in the middle. The second one is more future-proof but is based on a CSS class tag that also may change. It's always a trade-off.

电话号码和电子邮件类似,但显示了XPATH的功能.

The phone number & email are similar but show the power of XPATH.

这篇关于高级HTML Agility Pack使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆