Htmlagilitypack doc.loadhtml无法获取整个HTML字符串 [英] Htmlagilitypack doc.loadhtml can't get whole HTML string

查看:99
本文介绍了Htmlagilitypack doc.loadhtml无法获取整个HTML字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,

我正在尝试解析下面这个页面

我试图解析的网页 [ ^ ]

当我使用webrequest下载html字符串时,它没有完整的html字符串

所以我无法解析页面的内容部分

有人能帮助我吗?



Hello,
I'm trying to parse this page below
The webpage I'm trying to parse[^]
When I download html string using webrequest, it doesn't have whole html strings
so I can't parse the contents part of the page
Can anybody help me?

private void get_cotents(string contents_url)
        {
            string title = "";
            string contents = "";

            WebClient client = new WebClient();
            string sourceUrl = client.DownloadString(contents_url);
            HtmlAgilityPack.HtmlDocument mydoc = new HtmlAgilityPack.HtmlDocument();
            mydoc.LoadHtml(sourceUrl);

            string str =  mydoc.DocumentNode.InnerHtml;


            if (mydoc.DocumentNode != null)
            {
                var titleHeadline =               mydoc.DocumentNode.SelectSingleNode("//[@id='writeContents']");
     title = titleHeadline.InnerText;
             
             contents="I can't find the html code that has content";
             }
}





我的尝试:



我试过使用webclient获取html字符串和htmlweb



What I have tried:

I have tried getting html string using webclient and htmlweb

推荐答案

我认为你的问题在于获取数据流,这里是一个改编自CodeProject文章的例子:

I think your problem lies in getting the datastream, here is an example adapted from a CodeProject article:
/// <summary>
/// http://www.codeproject.com/Articles/18034/HttpWebRequest-Response-in-a-Nutshell-Part
/// </summary>
/// <param name="contents_url">The URL string.</param>
private static void get_cotents(string contents_url)
{
    byte[] buffer = new byte[1024];
    HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create(contents_url);
    WebReq.Method = "POST";
    WebReq.ContentType = "application/x-www-form-urlencoded";
    WebReq.ContentLength = buffer.Length;
    Stream PostData = WebReq.GetRequestStream();
    //Now we write, and afterwards, we close. Closing is always important!
    PostData.Write(buffer, 0, buffer.Length);
    PostData.Close();
    //Get the response handle, we have no true response yet!
    HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();

    //Let's show some information about the response
    Console.WriteLine(WebResp.StatusCode);
    Console.WriteLine(WebResp.Server);

    //Now, we read the response (the string), and output it.
    Stream datastream = WebResp.GetResponseStream();
    StreamReader answer = new StreamReader(datastream);
    Console.WriteLine(answer.ReadToEnd());
    datastream.Close();
    answer.Close();
}





我认为你可以自己完成剩下的代码......



I think you can finish the rest of the code yourself ...


问题是搜索内容div id ...

好​​像网站隐藏了内容区域ID。

我刚用xpath解决了这个问题,如下所示



HtmlNode node = mydoc.DocumentNode.SelectSingleNode(// @ id [。='sub_wkb_layout']);



谢谢大家和codeproject

我喜欢这个网站:)
The problem was searching content div id...
It seems like the website hides the content area id.
I just solved this problem using xpath like this below

HtmlNode node = mydoc.DocumentNode.SelectSingleNode("//@id[.='sub_wkb_layout']");

Thank you guys and codeproject
I love this site :)


这篇关于Htmlagilitypack doc.loadhtml无法获取整个HTML字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆