使用HttpWebRequest下载网站 [英] Downloading WebSites using HttpWebRequest

查看:64
本文介绍了使用HttpWebRequest下载网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个预先缓存的引擎...一个请求在远程缓存它们的超过100页的远程服务器...

我可以使用HttpWebRequest和WebResponse类为此?或者我必须使用

MSHTML对象来真正加载HTML并请求

网站上的所有图片吗?


string lcUrl = http://www.cnn.com;


// ***确定请求


HttpWebRequest loHttp =


(HttpWebRequest)WebRequest.Create(lcUrl);


// ***设置属性


loHttp.Timeout = 10000; // 10秒


loHttp.UserAgent =" Code Sample Web Client";


// ***检索请求信息标题


HttpWebResponse loWebResponse =(HttpWebResponse)loHttp.GetResponse();


编码enc = Encoding.GetEncoding(1252); // Windows默认代码页

StreamReader loResponseStream =


新StreamReader(loWebResponse.GetResponseStream(),enc);
< br $>
string lcHtml = loResponseStream.ReadToEnd();


loWebResponse.Close();


loResponseStream.Close( );

I am building a precache engine... one that request over 100 pages on an
remote server to cache them remotely...
can i use the HttpWebRequest and WebResponse classes for this? or must i use
the MSHTML objects to really load the HTML and request all of the images on
site?

string lcUrl = http://www.cnn.com;

// *** Establish the request

HttpWebRequest loHttp =

(HttpWebRequest) WebRequest.Create(lcUrl);

// *** Set properties

loHttp.Timeout = 10000; // 10 secs

loHttp.UserAgent = "Code Sample Web Client";

// *** Retrieve request info headers

HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();

Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page

StreamReader loResponseStream =

new StreamReader(loWebResponse.GetResponseStream(),enc );

string lcHtml = loResponseStream.ReadToEnd();

loWebResponse.Close();

loResponseStream.Close();

推荐答案

嗨Thomas,


至于请求和缓存远程页面问题,我认为

HttpWebRequest能够处理这个问题。我们可以使用HttpWebRequest来向某个网址发送请求并获取它的响应流,因此,我们可以将
存储响应结果(Html或任何其他mime类型)进入持久性

我们想要的媒介,例如,文件系统,内存,数据库或......


和MSHTML组件是组件库帮助

编程处理某个网页的响应作为文档(DOM

结构),就像我们在Web浏览器中可以做的那样。如果我们只想获得响应结果(html输出或文件流),那么HttpWEbRequest

就足够了,并且不需要MSHTML。

此外,这里有一些关于使用HttpWebRequest来请求
请求网络资源的技术文章:


#使用桌面应用程序访问网站 http://www.devsource .ziffdavis.com / p ... = 119849,00.asp

#Crawl使用ADO.NET和Visual的任何数据存储的网站和目录信息/>
Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/


希望也有帮助。谢谢。


问候,


Steven Cheng

微软在线支持


安全! www.microsoft.com/security

(此帖子按原样提供,不作任何保证,并且不授予

权利。)


在ASP.NET上获取预览whidbey
< a rel =nofollowhref =http://msdn.microsoft.com/asp.net/whidbey/default.aspxtarget =_ blank> http://msdn.microsoft.com/asp.net/whidbey /default.aspx
Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it''s response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence
medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page''s response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to
get the response result (the html ouput or file stream), the HttpWEbRequest
is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and Visual
Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx


谢谢史蒂文,


我需要确保我远程缓存所有html包括所有

pitcures ...因此我认为一个简单的WebRequest不会...

因此我试图将GetResponseStream()转换为HTMLDocument对象

确保整个网站加载...

但是

StreamReader readStream = new StreamReader(receiveStream,Encoding.UTF8);


string tmp = readStream.ReadLine();


HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc =(HTMLDocument)tmp; // ???如何获得响应流

到/作为htmldocument?


有什么想法吗?



>
/// ---------------完整示例


HttpWebRequest request =(HttpWebRequest)WebRequest.Create

http://www.microsoft.com);


request.MaximumAutomaticRedirections = 4;


request.MaximumResponseHeadersLength = 4;

HttpWebResponse response =(HttpWebResponse)request.GetResponse();


Console.WriteLine(" Content length is {0 }",response.ContentLength);


Console.WriteLine(" Content type is {0}",response.ContentType);


流receiveStream = response.GetResponseStream();


StreamReader readStream = new StreamReader(receiveStream,Encoding.UTF8);


string tmp = readStream.ReadLine();


HTMLDocument htmlDoc = new HTMLDocumentClass();


htmlDoc =(HTMLDocument)tmp;

response.Close();


readStream.Close();

" Steven Cheng [MSFT]" <,V - ****** @ online.microsoft.com>在消息中写道

news:kB ************** @ cpmsftngxa10.phx.gbl ...
Thanks Steven,

I need to make sure that i am remotely caching all of the html including all
pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument object to
ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?



///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;
response.Close ();

readStream.Close ();
"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:kB**************@cpmsftngxa10.phx.gbl...
嗨Thomas, HttpWebRequest能够处理这个问题。我们可以使用HttpWebRequest向某个url发送请求并获取它的响应流,因此,我们可以将响应结果(Html或任何其他mime类型)存储到持久性中
我们想要的媒介,例如文件系统,内存,数据库或......

MSHTML组件是组件库,有助于逐步处理某个网页的响应作为文档(DOM
结构),就像我们在Web浏览器中可以做的那样。如果我们只是想获得响应结果(html输出或文件流),
HttpWEbRequest就足够了,并且不需要MSHTML。
此外,这里有一些技术文章使用HttpWebRequest来请求网络资源:

使用桌面应用程序访问网站
http://www.devsource.ziffdavis.com/p...=119849,00.asp <使用ADO.NET和
Visual Basic .NET #Crawl网站和目录信息到任何数据存储
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

希望也有帮助。谢谢。

问候,

Steven Cheng
微软在线支持

获得安全! www.microsoft.com/security
(此帖已提供按原样,没有任何保证,也没有授予
权利。)

在ASP.NET上预览whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx
Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it''s response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence
medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page''s response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to
get the response result (the html ouput or file stream), the HttpWEbRequest is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and Visual Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx



现在看来我不能使用HttpWebRequest,因为我需要能够

指定主机头....和HttpWebRequest.Headers HOST由

系统设置为当前主机信息,现在我可以对其进行修改..


我需要检索网页远程服务器缓存它...任何想法?


Thomas Peter <人******* @ K.com>在消息中写道

新闻:OT ************** @ tk2msftngp13.phx.gbl ...
it now appears that i cannot use HttpWebRequest because i need to be able to
specify the Host Header.... and HttpWebRequest.Headers HOST is set by the
system to the current host information and now way for me to modify it..

I need to retrive webpages for the remote server to cache it... any ideas?


"Thomas Peter" <al*******@K.com> wrote in message
news:OT**************@tk2msftngp13.phx.gbl...
谢谢史蒂文,

我需要确保我远程缓存所有的html,包括
所有pitcures ...因此我认为一个简单的WebRequest不会...
所以我想尝试将GetResponseStream()放入HTMLDocument对象
以确保整个站点加载...
但是StreamReader readStream = new StreamReader(receiveStream,Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc =(HTMLDocument)tmp; // ???如何将响应流
导入/作为htmldocument?

任何想法?


/// ---- -----------完整示例

HttpWebRequest request =(HttpWebRequest)WebRequest.Create
http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;

HttpWebResponse response =(HttpWebResponse)request.GetResponse();

Console.WriteLine(" Content length is {0}",response.ContentLength);

Console.WriteLine(" Content type is {0}",response.ContentType);

流receiveStream = response.GetResponseStream();

StreamReader readStream = new StreamReader(receiveStream,Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc =(HTMLDocument)tmp;

response.Close();

readStream.Close();

" Steven程[MSFT] QUOT; <,V - ****** @ online.microsoft.com>在消息中写道
新闻:kB ************** @ cpmsftngxa10.phx.gbl ...
Thanks Steven,

I need to make sure that i am remotely caching all of the html including all pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument object to ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?



///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;
response.Close ();

readStream.Close ();
"Steven Cheng[MSFT]" <v-******@online.microsoft.com> wrote in message
news:kB**************@cpmsftngxa10.phx.gbl...
嗨Thomas,
HttpWebRequest能够处理这个问题。我们可以使用HttpWebRequest向某个url发送请求并获取它的响应流,因此,我们可以将响应结果(Html或任何其他mime类型)存储到
持久性中我们想要的媒介,例如文件系统,内存,数据库或......

MSHTML组件是组件库,有助于逐步处理某个网页的响应作为文档(DOM
结构),就像我们在Web浏览器中可以做的那样。如果我们只想要
来获得响应结果(html输出或文件流),那么
Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it''s response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page''s response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to get the response result (the html ouput or file stream), the


HttpWEbRequest


HttpWEbRequest

就足够了,并且不需要MSHTML。另外,这里有一些关于使用HttpWebRequest来请求网络资源的技术文章:

#Accessing网站使用桌面应用程序
http://www.devsource.ziffdavis.com/p .. 。= 119849,00.asp

#Crawl网站和目录信息到任何数据存储与ADO.NET和
is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/p...=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and


Visual


Visual

基本.NET
http:// msdn.microsoft.com/msdnmag/is...0/spiderinnet/
希望也有帮助。谢谢。

问候,

Steven Cheng
微软在线支持

获得安全! www.microsoft.com/security
(此帖已提供按原样,没有任何保证,也没有授予
权利。)

在ASP.NET上预览whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx
Basic .NET
http://msdn.microsoft.com/msdnmag/is...0/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx




这篇关于使用HttpWebRequest下载网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆