刮C#和HTMLAgility网页 [英] Scraping a webpage with C# and HTMLAgility

查看:127
本文介绍了刮C#和HTMLAgility网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读,HTMLAgility 1.4是一个很好的解决方案,以刮的网页。作为一个新的程序员,我希望我能得到这个项目的一些投入。 我做的这是一个C#应用程序的形式。我有工作的页面是相当简单的。我需要的信息被套牢仅有2标签之间  。我的目标是拉动数据对部分民,马努 - 号,说明,马努国,上次修改,上次修改通过了网页和数据发送到SQL表。一个转折是,也有一个小PNG PIC卡还需要从SRC抓起=/一部分code /号。

我没有任何完成code的炒菜锅。我想到了code此位会告诉我,如果我是朝着正确的方向发展。即使步入调试,我不能看到它做任何事情。可能有人可能指向我在正确的方向上这一点。越详细越好,因为很明显我有很多东西要学。谢谢你,我会真的AP preciate吧。

 使用系统;
使用System.Collections.Generic;
使用System.Linq的;
使用System.Text;
使用HtmlAgilityPack;
使用的System.Xml;

命名空间统计
{
    类PartParser
    {
静态无效的主要(字串[] args)
        {
            的HTMLDocument DOC =新的HTMLDocument();
            doc.LoadHtml(HTTP:// localhost的); //我的理解这读取整个页面?
            无功表= doc.DocumentNode.SelectNodes(//表); //我认为这将会使搜索包含表字

}
            赶上(例外前)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
                Console.ReadKey();

            }
        }
    }
}



  该网站的code是:

<!DOCTYPE HTML
     PUBLIC -  // W3C // DTD XHTML 1.0过渡// EN
     http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
< HTML的xmlns =htt​​p://www.w3.org/1999/xhtmlXML:LANG =ENLANG =EN>
< HEAD>
< META HTTP-当量=Content-Type的CONTENT =text / html的;字符集= UTF-8/>
<冠军>部分号码数据库:项目记录< /标题>





<表类=数据>




< TR>< TD>部分-民< / TD>< TD宽度=50>< / TD>< TD>< IMG SRC =/一部分code /号/ 072140 ALT =072140/>< / TD>< / TR>




< TR>< TD>马努 - 数字和LT; / TD>< TD宽度=50>< / TD>< TD>< IMG SRC =/一部分code /马努/ 00721​​408 ALT =00721​​408/>< / TD>< / TR>

< TR>< TD>简介< / TD>< TD>< / TD>< TD>的Widget 3.5< / TD>< / TR>



< TR>< TD>马努 - 国家< / TD>< TD>< / TD>< TD>美国< / TD>< / TR>

< TR>< TD>最后修改< / TD>< TD>< / TD>< TD> 26 2009年1月,下午8点08< / TD>< / TR>


< TR>< TD>最后修改者< / TD>< TD>< / TD>< TD>
马努

< / TD>< / TR>




< /表>



&其中p为H.;


< /身体GT;< / HTML>
 

解决方案

看看这篇文章对4GuysFromRolla

http://www.4guysfromrolla.com/articles/011211-1.aspx

这是我作为我与HTML敏捷性包起点文章和它的工作太棒了。我相信你会得到所有你从这篇文章需要执行你要完成的任务的信息。

I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as a c# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags and . My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By out of the page and send the data to a sql table. One twist is that there is also a small png pic that also need to be grabbed from the src="/partcode/number.

I do not have any completed code that woks. I thought this bit of code would tell me if I am heading in the right direction. Even stepping into the debug I can’t see that it does anything . Could someone possibly point me in the right direction on this. The more detailed the better since it is apparent I have a lot to learn. Thank you I would really appreciate it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Xml;

namespace Stats
{
    class PartParser
    {
static void Main(string[] args)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml("http://localhost");                            //my understanding this reads the entire page in?
            var tables = doc.DocumentNode.SelectNodes("//table");        // I assume that this sets up the search for words containing table

}
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
                Console.ReadKey();

            }
        }
    }
}



  The web code is:

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Part Number Database: Item Record</title>





<table class="data">




<tr><td>Part-Num</td><td width="50"></td><td><img src="/partcode/number/072140" alt="072140"/></td></tr>




<tr><td>Manu-Number</td><td width="50"></td><td><img src="/partcode/manu/00721408" alt="00721408" /></td></tr>

<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>



<tr><td>Manu-Country</td><td></td><td>United States</td></tr>

<tr><td>Last Modified</td><td></td><td>26 Jan 2009,  8:08 PM</td></tr>


<tr><td>Last Modified By</td><td></td><td>
Manu

</td></tr>




</table>



<p>


</body></html>

解决方案

Check out this article on 4GuysFromRolla

http://www.4guysfromrolla.com/articles/011211-1.aspx

This is the article I used as my starting point with HTML Agility Pack and it's worked great. I'm confident that you'll get all the information you need from this article to perform the tasks you're trying to complete.

这篇关于刮C#和HTMLAgility网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆