如何取消大文件下载,但仍然在C#中获取页面源代码? [英] How to cancel large file download yet still get page source in C#?

查看:115
本文介绍了如何取消大文件下载,但仍然在C#中获取页面源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用C#在一个程序中列出MOOC(例如Coursera)的所有课程资源。我不想下载内容,只需获取课程提供的所有资源(例如pdf,视频,文本文件,样本文件等)的列表。



我的问题在于在不下载所有内容的情况下解析html源代码(目前使用HtmlAgilityPack)。

例如,如果你去这个介绍Coursera的银行课程视频并查看源代码(Chrome for Developer Tools中的F12),你可以看到页面源码。我可以停止自动播放的视频下载,但仍然可以看到源代码。



如何在不下载所有内容的情况下在C#中获得源代码?
我查看了HttpWebRequest头文件(问题:超时),并使用Cancel取消DownloadDataAsync(问题:取消异步请求时,Completed Result对象无效)。我也尝试了HtmlAgilityPack的各种负载,但没有成功。



超时:

  HttpWebRequest postRequest =(HttpWebRequest)WebRequest.Create(url); 
postRequest.Timeout = TIMEOUT * 1000000; //真的很长
postRequest.Referer =https://www.coursera.org;

if(headers!= null)
{// header here}

//处理cookies
if(cookie!= null)
{cookieJar.Add(cookie); }

postRequest.CookieContainer = cookiejar;
postRequest.Method =GET;
postRequest.AllowAutoRedirect = allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse =(HttpWebResponse)postRequest.GetResponse();

有关如何继续的提示?

解决方案

至少有两种方法可以做你想问的问题。首先是使用范围get。也就是说,指定要读取的文件的范围。你可以通过调用 AddRange 来做到这一点 HttpWebRequest 。所以,如果你想要的话,比如文件的第一个10千字节,你可以这样写:

  request.AddRange(-10240 ); 

仔细阅读文档中关于该参数含义的内容。如果它是负数,它指定范围的结尾点。还有其他重载的 AddRange ,您可能会感兴趣。



并非所有服务器都支持范围获取,尽管。如果这不起作用,您必须以另一种方式进行。



您可以做的是调用 GetResponse

  string url =https://www.coursera.org /当然/钱; 
HttpWebRequest postRequest =(HttpWebRequest)WebRequest.Create(url);
postRequest.Method =GET;
postRequest.AllowAutoRedirect = true; // allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse =(HttpWebResponse)postRequest.GetResponse();
int maxBytes = 1024 * 1024;
int totalBytesRead = 0;
var buffer = new byte [maxBytes];
using(var s = postResponse.GetResponseStream())
{
int bytesRead;
//从响应
中读取'maxBytes'字节,而totalBytesRead< maxBytes&&(bytesRead = s.Read(buffer,0,maxBytes))!= 0)
{
//这里您可以将读取的字节保存到持久缓冲区
//或将它们写入文件。
Console.WriteLine({0:N0} bytes read,bytesRead);
totalBytesRead + = bytesRead;
}
}
Console.WriteLine(total bytes read = {0:N0},totalBytesRead);

也就是说,我运行了这个示例,它下载了大约6千字节并停止了。我不知道为什么你会遇到超时或数据太多的问题。



请注意,有时在读取整个响应之前尝试关闭流会导致程序挂起。我不确定为什么会发生这种情况,我无法解释为什么它只是偶尔发生。但是你可以通过在关闭流之前调用 request.Abort 来解决它。即:

  using(var s = postResponse.GetResponseStream())
{
// do这里的东西
//在继续
之前中止请求postRequest.Abort();
}


I'm working in C# on a program to list all course resources for a MOOC (e.g. Coursera). I don't want to download the content, just get a listing of all the resources (e.g. pdf, videos, text files, sample files, etc...) which are made available to the course.

My problem lies in parsing the html source (currently using HtmlAgilityPack) without downloading all the content.

For example, if you go to this intro video for a banking course on Coursera and check the source (F12 in Chrome for Developer Tools), you can see the page source. I can stop the video download which autoplays, but still see the source.

How can I get the source in C# without download all the content? I've looked in the HttpWebRequest headers (problem: time out), and DownloadDataAsync with Cancel (problem: the Completed Result object is invalid when cancelling the async request). I've also tried various Loads from HtmlAgilityPack but with no success.

Time out:

        HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
        postRequest.Timeout = TIMEOUT * 1000000; //Really long
        postRequest.Referer = "https://www.coursera.org"; 

        if (headers != null)
        { //headers here }

        //Deal with cookies
        if (cookie != null)
        { cookieJar.Add(cookie); }

        postRequest.CookieContainer = cookiejar;
        postRequest.Method = "GET";
        postRequest.AllowAutoRedirect = allowRedirect;
        postRequest.ServicePoint.Expect100Continue = true;
        HttpWebResponse postResponse = (HttpWebResponse)postRequest.GetResponse();

Any tips on how to proceed?

解决方案

There are at least two ways to do what you're asking. The first is to use a range get. That is, specify the range of the file you want to read. You do that by calling AddRange on the HttpWebRequest. So if you want, say, the first 10 kilobytes of the file, you'd write:

request.AddRange(-10240);

Read carefully what the documentation says about the meaning of that parameter. If it's negative, it specifies the ending point of the range. There are also other overloads of AddRange that you might be interested in.

Not all servers support range gets, though. If that doesn't work, you'll have to do it another way.

What you can do is call GetResponse and then start reading data. Once you've read as much data as you want, you can stop reading and close the stream. I've modified your sample slightly to show what I mean.

string url = "https://www.coursera.org/course/money";
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = true; //allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse) postRequest.GetResponse();
int maxBytes = 1024*1024;
int totalBytesRead = 0;
var buffer = new byte[maxBytes];
using (var s = postResponse.GetResponseStream())
{
    int bytesRead;
    // read up to `maxBytes` bytes from the response
    while (totalBytesRead < maxBytes && (bytesRead = s.Read(buffer, 0, maxBytes)) != 0)
    {
        // Here you can save the bytes read to a persistent buffer,
        // or write them to a file.
        Console.WriteLine("{0:N0} bytes read", bytesRead);
        totalBytesRead += bytesRead;
    }
}
Console.WriteLine("total bytes read = {0:N0}", totalBytesRead);

That said, I ran this sample and it downloaded about 6 kilobytes and stopped. I don't know why you're having trouble with timeouts or too much data.

Note that sometimes trying to close the stream before the entire response is read will cause the program to hang. I'm not sure why that happens at all, and I can't explain why it only happens sometimes. But you can solve it by calling request.Abort before closing the stream. That is:

using (var s = postResponse.GetResponseStream())
{
    // do stuff here
    // abort the request before continuing
    postRequest.Abort();
}

这篇关于如何取消大文件下载,但仍然在C#中获取页面源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆