如何通过网站的HTML内容解析 [英] How to parse through website HTML content

查看:109
本文介绍了如何通过网站的HTML内容解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析网站的HTML,说CNN.com但每次我用web浏览器对象浏览,我得到了一堆空值的为我的对象。我不使用HTML敏捷性包。每当我打电话导航方法,mywebBrowser包含空和空值。我如何获得tagCollection来填充?我试图做的webClient.DownloadString只是为了获得HTML页面的所有内容,我不能用这个,因为我需要找到所有的标签和做手工很凌乱。我也不能使用HTML敏捷性包。

 使用(Web客户端Web客户端=新的WebClient())
        {
            webClient.Encoding = Encoding.UTF8;
            HtmlString = webClient.DownloadString(textBox1.Text);
        }        web浏览器mywebBrowser =新的WebBrowser();
        URI地址=新的URI(http://www.cnn.com/);
        mywebBrowser.Navigate(地址);        // HtmlString确实包含所有的页面的HTML
        mywebBrowser.DocumentText = HtmlString;
        // DocumentText只有< HTML>< / HTML>分配后
        DOC的HTMLDocument = mywebBrowser.Document;
        HtmlElementCollection tagCollection;
        tagCollection = doc.GetElementsByTagName(< D​​IV);


解决方案

web浏览器类允许你做很多事情,而不必依赖任何外部库。你所缺少的是 DocumentCompleted事件;它是web浏览器的基本定义的一部分:达到该部分之前的页面没有完全加载,因此相应的信息是错误的(或空)。还要记住,在的getElementsByTagName 你有标签的只是输入名称(不带<)。示例code显示此:

  web浏览器mywebBrowser;
 私人无效Form1_Load的(对象发件人,EventArgs的发送)
 {
     mywebBrowser =新的WebBrowser();
     mywebBrowser.DocumentCompleted + =新WebBrowserDocumentCompletedEventHandler(mywebBrowser_DocumentCompleted);     URI地址=新的URI(http://www.cnn.com/);
     mywebBrowser.Navigate(地址);
 } 私人无效mywebBrowser_DocumentCompleted(对象发件人,WebBrowserDocumentCompletedEventArgs E)
 {
    //直到这一刻的页面不完全加载
     DOC的HTMLDocument = mywebBrowser.Document;
     HtmlElementCollection tagCollection;
     tagCollection = doc.GetElementsByTagName(分区);
 }

I am trying to parse the HTML of a website, say CNN.com but everytime I navigate with a WebBrowser object, i get a bunch of null values for my object. I am NOT using the HTML Agility Pack. Whenever I call the Navigate method, mywebBrowser contains null and blank values. How do I get the tagCollection to populate? I tried doing the webClient.DownloadString just to get all the content of the HTML page, I can't use this because I will need to find all the tags and doing it manually is very messy. I also can NOT use the HTML Agility Pack.

        using (WebClient webClient = new WebClient())
        {
            webClient.Encoding = Encoding.UTF8;
            HtmlString = webClient.DownloadString(textBox1.Text);
        }

        WebBrowser mywebBrowser = new WebBrowser();
        Uri address = new Uri("http://www.cnn.com/");
        mywebBrowser.Navigate(address);

        //HtmlString does contain all the HTML from Page
        mywebBrowser.DocumentText = HtmlString; 
        //DocumentText only has "<HTML></HTML> after assignment


        HtmlDocument doc = mywebBrowser.Document;
        HtmlElementCollection tagCollection;
        tagCollection = doc.GetElementsByTagName("<div");

解决方案

The WebBrowser Class allows you do many things without having to rely on any external library. What you are missing is the DocumentCompleted Event; it is part of the basic definition of the WebBrowser: before reaching this part the page is not completely loaded and thus the corresponding information is faulty (or null). Also bear in mind that in GetElementsByTagName you have just to input the name of the tag (without "<"). Sample code to show this:

 WebBrowser mywebBrowser;
 private void Form1_Load(object sender, EventArgs e)
 {
     mywebBrowser = new WebBrowser();
     mywebBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(mywebBrowser_DocumentCompleted);

     Uri address = new Uri("http://www.cnn.com/");
     mywebBrowser.Navigate(address);
 }

 private void mywebBrowser_DocumentCompleted(Object sender, WebBrowserDocumentCompletedEventArgs e)
 {
    //Until this moment the page is not completely loaded
     HtmlDocument doc = mywebBrowser.Document;
     HtmlElementCollection tagCollection;
     tagCollection = doc.GetElementsByTagName("div");
 }

这篇关于如何通过网站的HTML内容解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆