如何使用HTMLAgilityPack提取HTML数据 [英] How to use HTMLAgilityPack to extract HTML data

查看：48 发布时间：2020/11/24 19:53:29 c# web-crawler html-agility-pack

本文介绍了如何使用HTMLAgilityPack提取HTML数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在学习编写网络爬虫，并找到了一些很好的例子来使我入门，但是由于我是新手，因此我对编码方法有一些疑问.

I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.

例如，可以在此处找到搜索结果:

The search result for example can be found here: Search Result

当我查看结果的HTML源代码时，可以看到以下内容:

When I look at the HTML source for the result I can see the following:

<HR><CENTER><H3>License Information *</H3></CENTER><HR>                                                                       
<P>                                                                                                                           
       <CENTER> 06/03/2014 </CENTER> <BR>                                                                                     
<B>Name : </B> WILLIAMS AJAYA L                     <BR>                                                                      
<B>Address : </B> NEW YORK            NY                                          <BR>                                        
<B>Profession : </B> ATHLETIC TRAINER                          <BR>                                                           
<B>License No: </B> 001475 <BR>                                                                                               
<B>Date of Licensure : </B> 01/12/07      <BR>                                                                                
                                                                                                                                <B>Additional Qualification : </B>     &nbsp; Not applicable in this profession                       <BR>                    
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED                                        <BR>
<B>Registered through last day of : </B> 08/15      <BR>

如何使用HTMLAgilityPack从网站上抓取这些数据?

How can I use the HTMLAgilityPack to scrap those data from the site?

我试图实现一个如下所示的示例，但是不确定在哪里进行编辑以使其能够爬网:

I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:

private void btnCrawl_Click(object sender, EventArgs e)
        {
            foreach (SHDocVw.InternetExplorer ie in shellWindows)
            {
                filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();

                if ( filename.Equals( "iexplore" ) )
                txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
            }
            string url = ie.LocationURL.ToString();
            string xmlns = "{http://www.w3.org/1999/xhtml}";
            Crawler cl = new Crawler(url);
            XDocument xdoc = cl.GetXDocument();
            var res = from item in xdoc.Descendants(xmlns + "div")
                      where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
                      && item.Element(xmlns + "a") != null
                      //select item;
                      select new
                      {
                          Link = item.Element(xmlns + "a").Attribute("href").Value,
                          Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
                          Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
                          Desc = item.Elements(xmlns + "p").ElementAt(1).Value
                      };
            foreach (var node in res)
            {
                MessageBox.Show(node.ToString());
                tb.Text = node + "\n";
            }
            //Console.ReadKey();                   
        }

Crawler助手类:

The Crawler helper class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;

namespace CrawlerWeb
{
    public class Crawler
    {

        public string Url
        {
            get;
            set;
        }
        public Crawler() { }
        public Crawler(string Url)
        {
            this.Url = Url;
        }
        public XDocument GetXDocument()
        {
            HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
            doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
            HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
            doc2.OptionOutputAsXml = true;
            doc2.OptionAutoCloseOnEnd = true;
            doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
            XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
            return xdoc;
        }
    }
}

tb是一个多行文本框...所以我希望它显示以下内容:

tb is a multiline textbox... So I would like it to display the following:

Name WILLIAMS AJAYA L

Address NEW YORK NY

Profession ATHLETIC TRAINER

License No 001475

Date of Licensure 1/12/07

Additional Qualification Not applicable in this profession

Status REGISTERED

Registered through last day of 08/15

我希望将第二个参数添加到数组中，因为下一步将是写入SQL数据库...

I would like the second argument to be added to an array because next step would be to write to a SQL database...

我可以从具有搜索结果的IE中获取URL，但是如何在脚本中进行编码?

I am able to get the URL from the IE which has the search result but how can I code it in my script?

如何使用HTMLAgilityPack提取HTML数据 [英] How to use HTMLAgilityPack to extract HTML data

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

如何使用HTMLAgilityPack提取HTML数据 [英] How to use HTMLAgilityPack to extract HTML data

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭