如何使用HTMLAgilityPack提取HTML数据 [英] How to use HTMLAgilityPack to extract HTML data

查看:48
本文介绍了如何使用HTMLAgilityPack提取HTML数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习编写网络爬虫,并找到了一些很好的例子来使我入门,但是由于我是新手,因此我对编码方法有一些疑问.

I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.

例如,可以在此处找到搜索结果:

The search result for example can be found here: Search Result

当我查看结果的HTML源代码时,可以看到以下内容:

When I look at the HTML source for the result I can see the following:

<HR><CENTER><H3>License Information *</H3></CENTER><HR>                                                                       
<P>                                                                                                                           
       <CENTER> 06/03/2014 </CENTER> <BR>                                                                                     
<B>Name : </B> WILLIAMS AJAYA L                     <BR>                                                                      
<B>Address : </B> NEW YORK            NY                                          <BR>                                        
<B>Profession : </B> ATHLETIC TRAINER                          <BR>                                                           
<B>License No: </B> 001475 <BR>                                                                                               
<B>Date of Licensure : </B> 01/12/07      <BR>                                                                                
                                                                                                                                <B>Additional Qualification : </B>     &nbsp; Not applicable in this profession                       <BR>                    
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED                                        <BR>
<B>Registered through last day of : </B> 08/15      <BR>

如何使用HTMLAgilityPack从网站上抓取这些数据?

How can I use the HTMLAgilityPack to scrap those data from the site?

我试图实现一个如下所示的示例,但是不确定在哪里进行编辑以使其能够爬网:

I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:

private void btnCrawl_Click(object sender, EventArgs e)
        {
            foreach (SHDocVw.InternetExplorer ie in shellWindows)
            {
                filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();

                if ( filename.Equals( "iexplore" ) )
                txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
            }
            string url = ie.LocationURL.ToString();
            string xmlns = "{http://www.w3.org/1999/xhtml}";
            Crawler cl = new Crawler(url);
            XDocument xdoc = cl.GetXDocument();
            var res = from item in xdoc.Descendants(xmlns + "div")
                      where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
                      && item.Element(xmlns + "a") != null
                      //select item;
                      select new
                      {
                          Link = item.Element(xmlns + "a").Attribute("href").Value,
                          Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
                          Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
                          Desc = item.Elements(xmlns + "p").ElementAt(1).Value
                      };
            foreach (var node in res)
            {
                MessageBox.Show(node.ToString());
                tb.Text = node + "\n";
            }
            //Console.ReadKey();                   
        }

Crawler助手类:

The Crawler helper class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;

namespace CrawlerWeb
{
    public class Crawler
    {

        public string Url
        {
            get;
            set;
        }
        public Crawler() { }
        public Crawler(string Url)
        {
            this.Url = Url;
        }
        public XDocument GetXDocument()
        {
            HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
            doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
            HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
            doc2.OptionOutputAsXml = true;
            doc2.OptionAutoCloseOnEnd = true;
            doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
            XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
            return xdoc;
        }
    }
}

tb是一个多行文本框...所以我希望它显示以下内容:

tb is a multiline textbox... So I would like it to display the following:

Name WILLIAMS AJAYA L

Address NEW YORK NY

Profession ATHLETIC TRAINER

License No 001475

Date of Licensure 1/12/07

Additional Qualification Not applicable in this profession

Status REGISTERED

Registered through last day of 08/15

我希望将第二个参数添加到数组中,因为下一步将是写入SQL数据库...

I would like the second argument to be added to an array because next step would be to write to a SQL database...

我可以从具有搜索结果的IE中获取URL,但是如何在脚本中进行编码?

I am able to get the URL from the IE which has the search result but how can I code it in my script?

推荐答案

此小片段应该可以帮助您入门:

This little snippet should get you started:

HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");

基本上,您可以使用WebClient类下载HTML文件,然后将该HTML加载到HtmlDocument对象中.然后,您需要使用XPath查询DOM树并搜索节点.在上面的示例中,节点"将包括文档中的所有div元素.

You basically use the WebClient class to download the HTML file and then you load that HTML into the HtmlDocument object. Then you need to use XPath to query the DOM tree and search for nodes. In the above example "nodes" will include all the div elements in the document.

以下是有关XPath语法的快速参考: http://msdn.microsoft.com/zh-CN/library/ms256086(v = vs.110).aspx

Here's a quick reference about the XPath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx

这篇关于如何使用HTMLAgilityPack提取HTML数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆