Htmlagilitypack doc.loadhtml无法获取整个HTML字符串 [英] Htmlagilitypack doc.loadhtml can't get whole HTML string
本文介绍了Htmlagilitypack doc.loadhtml无法获取整个HTML字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
您好,
我正在尝试解析下面这个页面
我试图解析的网页 [ ^ ]
当我使用webrequest下载html字符串时,它没有完整的html字符串
所以我无法解析页面的内容部分
有人能帮助我吗?
Hello,
I'm trying to parse this page below
The webpage I'm trying to parse[^]
When I download html string using webrequest, it doesn't have whole html strings
so I can't parse the contents part of the page
Can anybody help me?
private void get_cotents(string contents_url)
{
string title = "";
string contents = "";
WebClient client = new WebClient();
string sourceUrl = client.DownloadString(contents_url);
HtmlAgilityPack.HtmlDocument mydoc = new HtmlAgilityPack.HtmlDocument();
mydoc.LoadHtml(sourceUrl);
string str = mydoc.DocumentNode.InnerHtml;
if (mydoc.DocumentNode != null)
{
var titleHeadline = mydoc.DocumentNode.SelectSingleNode("//[@id='writeContents']");
title = titleHeadline.InnerText;
contents="I can't find the html code that has content";
}
}
我的尝试:
我试过使用webclient获取html字符串和htmlweb
What I have tried:
I have tried getting html string using webclient and htmlweb
推荐答案
我认为你的问题在于获取数据流,这里是一个改编自CodeProject文章的例子:
I think your problem lies in getting the datastream, here is an example adapted from a CodeProject article:
/// <summary>
/// http://www.codeproject.com/Articles/18034/HttpWebRequest-Response-in-a-Nutshell-Part
/// </summary>
/// <param name="contents_url">The URL string.</param>
private static void get_cotents(string contents_url)
{
byte[] buffer = new byte[1024];
HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create(contents_url);
WebReq.Method = "POST";
WebReq.ContentType = "application/x-www-form-urlencoded";
WebReq.ContentLength = buffer.Length;
Stream PostData = WebReq.GetRequestStream();
//Now we write, and afterwards, we close. Closing is always important!
PostData.Write(buffer, 0, buffer.Length);
PostData.Close();
//Get the response handle, we have no true response yet!
HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();
//Let's show some information about the response
Console.WriteLine(WebResp.StatusCode);
Console.WriteLine(WebResp.Server);
//Now, we read the response (the string), and output it.
Stream datastream = WebResp.GetResponseStream();
StreamReader answer = new StreamReader(datastream);
Console.WriteLine(answer.ReadToEnd());
datastream.Close();
answer.Close();
}
我认为你可以自己完成剩下的代码......
I think you can finish the rest of the code yourself ...
问题是搜索内容div id ...
好像网站隐藏了内容区域ID。
我刚用xpath解决了这个问题,如下所示
HtmlNode node = mydoc.DocumentNode.SelectSingleNode(// @ id [。='sub_wkb_layout']);
谢谢大家和codeproject
我喜欢这个网站:)
The problem was searching content div id...
It seems like the website hides the content area id.
I just solved this problem using xpath like this below
HtmlNode node = mydoc.DocumentNode.SelectSingleNode("//@id[.='sub_wkb_layout']");
Thank you guys and codeproject
I love this site :)
这篇关于Htmlagilitypack doc.loadhtml无法获取整个HTML字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文