通过C#程序下载没有直接地址的文件 [英] Downloading files without direct address through C# program

查看:75
本文介绍了通过C#程序下载没有直接地址的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用C#编写一个程序,该程序需要从网页中提取所选文件的地址,然后下载文件.有问题的网站是http://www.un.org/depts/dhl/resguide/r1.htm(以及其他类似网站).问题是链接不是直接指向文件,如果您在浏览器中关注链接,它们将首先将您重定向到临时页面,然后再重定向到文件本身.如果我直接从页面上获得链接而不是从网页链接上获得链接,那么我不会直接进入文件,而是会看到错误页面.

关于如何通过程序访问实际文件地址的任何想法?

解决方案

好,所以我能够使它工作.我创建了一个简单的表单,上面有一个列表框.加载表单后,它会转到 http://www.un.org/depts/dhl/resguide /r1.htm [^ ]页面,并拉出所有链接.然后,当您单击链接(假设您单击pdf链接之一)时,它将经历重定向,获取Cookie以及将文件输出到临时文件的整个过程.代码如下:

 使用系统;
使用 System.Collections.Generic;
使用 System.ComponentModel;
使用 System.Data;
使用使用System.Drawing;
使用 System.Linq;
使用 System.Text;
使用使用System.Windows.Forms;
使用 System.Text.RegularExpressions;
使用 System.Net;
使用 System.IO;

命名空间 UNDocs
{
    公共 部分  class  Form1:表单
    {
        私有 常量 字符串 StartingPage =  @" 私有 常量 字符串 CookieOriginator =  @" ;

        公共 Form1()
        {
            InitializeComponent();
        }

        私有 无效 Form1_Load(对象发​​件人,EventArgs e)
        {
            // 在列表框中加载所有链接
            字符串 html = GetHTML(StartingPage);

            MatchCollection匹配项= GetLinks(html);

             foreach (匹配 in 匹配)
            {
                字符串  = match.Groups [" 链接"].Value;

                listBox1.Items.Add();
            }
        }

        私有 无效 button1_Click(对象发​​件人,EventArgs e)
        {
            // 获取临时页面以重定向到
            字符串 tempPage = GetURLBase(listBox1.SelectedItem.ToString())+
                              GetPageToRedirectTo(listBox1.SelectedItem.ToString(),StartingPage);

            // 获取要使用的Cookie 
            CookieContainer cookie = GetCookies(CookieOriginator,tempPage);

            // 这是具有指向实际页面链接的页面
            字符串 finalPage = GetPageToRedirectTo(tempPage);

            // 获取代表pdf文件的字节数组
            字节 [] pdf = GetBytesFromHTTP(finalPage,Cookies);

            // 将文件写入磁盘
            WriteFile( @" ,pdf);
        }

        公共 MatchCollection GetLinks(字符串 s)
        {
            Regex regex =  Regex(" ,RegexOptions.Multiline);
            返回 regex.
        }

        公共 字符串 GetHTML(字符串 url)
        {
            返回 GetHTML(URL," ) ;
        }

        公共 字符串 GetHTML(字符串 url,字符串引荐来源)
        {
            返回 GetHTML(网址,引荐来源, CookieContainer());
        }
     
        公共 字符串 GetHTML(字符串 url,字符串引用,CookieContainer cookie)
        {
            HttpWebRequest myRequest =(HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer =推荐人;
            myRequest.CookieContainer = Cookies;

            字符串 pageSource = " ;

            使用(HttpWebResponse响应=(HttpWebResponse)myRequest.GetResponse())
            {
                使用(StreamReader reader =  StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            返回 pageSource;
        }

        公共 字节 [] GetBytesFromHTTP(字符串 url ,CookieContainer cookie)
        {
            HttpWebRequest myRequest =(HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.CookieContainer = Cookies;
            myRequest.Headers.Add(HttpRequestHeader.AcceptEncoding," );

            字节 [] result =  null ;
            字节 []缓冲区=  字节 [  4096 ];

            使用(HttpWebResponse响应=(HttpWebResponse)myRequest.GetResponse())
            {
                使用(流responseStream = response.GetResponseStream())
                {
                    使用(MemoryStream memoryStream =  MemoryStream())
                    {
                         int  count =  0 ;

                        
                        {
                            count = responseStream.Read(buffer, 0 ,buffer.Length);
                            memoryStream.Write(buffer, 0 ,count);
                        } 同时(计数!=  0 );

                        结果= memoryStream.ToArray();
                    }
                }
            }

            返回结果;
        }

        私有 字符串 GetURLBase(字符串 url)
        {
            Regex regex =  Regex(" );
            返回 regex.Match(url).Groups[" ].Value;
        }

        私有 字符串 GetPageToRedirectTo(字符串 url)
        {
            返回 GetPageToRedirectTo(URL," ) ;
        }

        私有 字符串 GetPageToRedirectTo(字符串 url,字符串引荐来源)
        {
            HttpWebRequest myRequest =(HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer =推荐人;

            字符串 pageSource = " ;

            使用(HttpWebResponse响应=(HttpWebResponse)myRequest.GetResponse())
            {
                使用(StreamReader reader =  StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            字符串 urlToRedirectTo = " ;

            // 在元标记中获取URL 
            Regex regex =  Regex(" );
            urlToRedirectTo = regex.Match(pageSource).Groups [" ].Value;

            返回 urlToRedirectTo;
        }

        私有 CookieContainer GetCookies(字符串 url,字符串推荐人)
        {
            HttpWebRequest myRequest =(HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer =推荐人;
            myRequest.CookieContainer =  CookieContainer();
            myRequest.GetResponse().Close();

            返回 myRequest.CookieContainer;
        }

        公共 无效 WriteFile(字符串 FileName, byte  [] fileContents)
        {
            FileStream outFile =  FileStream(FileName,FileMode.Create);

            使用(BinaryWriter writer =  BinaryWriter(outFile))
            {
                writer.Write(fileContents, 0 ,fileContents.Length);
            }

            outFile.Dispose();
        }
    }
} 



(很有趣!)


那是一个非常奇怪的网页.我知道它正在设置cookie,因为在转到它之后,我的浏览器已设置了cookie.但是,当我尝试使用InternetGetCookies(...)时,它不返回任何内容.当我尝试使用HTTPWebRequest获取cookie时,也会发生同样的事情.

它们是动态创建的会话cookie,因此您不能只使用传递给浏览器的cookie中的值.

看起来它使用了google-analytics脚本来编写所需的cookie.所以,我不确定你怎么去做...

[更新]
好的,所以我想它只是使用GA进行一些跟踪,而这些cookie则不是必需的.

它正在设置它自己的cookie,我可以使用HttpFox进行跟踪.但是,我无法理解为什么跟随该页面时会得到两组不同的HTML.如果我点击Firefox中的链接,首先出现的是一个仅重定向的空白页面.但是,在代码中,它带有一个显示未授权"的页面.

啊!!!我想到了.您必须将HttpWebRequest.Referer属性设置为与引用页面相同..这将为您提供带有重定向的页面.

重定向的页面中有两个不同的链接.一个是要重定向到的页面,另一个是带有源的页面框架.该框架是设置cookie的原因.如果没有该cookie,则将永远不会加载要重定向到的页面.

我现在很好奇,如果我能写出可以贯穿整个过程的东西……


您可以下载临时html并进行解析以获取重定向的url. /blockquote>

I am making a program in C# that needs to pull the address of selected files from a webpage and then download the files. The website in question is http://www.un.org/depts/dhl/resguide/r1.htm (and various similar). The problem is that the links are not direct to the file, if you follow them in a browser they redirect you first to a temporary page and then to the file itself. If I follow the link given on the page direct rather than from the webpage link I do not get directed to the file but get an error page.

Any ideas on how I can reach the actual file address through my program?

解决方案

Ok, so I was able to get it to work. I created a simple form that has a Listbox on it. When the form loads, it goes to the http://www.un.org/depts/dhl/resguide/r1.htm[^] page and pulls out all of the links. Then, when you click on a link (assuming you click one of the pdf links), it goes through the whole process of redirecting, acquiring cookies, and then outputting the file to a temporary file. Here''s the code:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;

namespace UNDocs
{
    public partial class Form1 : Form
    {
        private const string StartingPage = @"http://www.un.org/depts/dhl/resguide/r1.htm";
        private const string CookieOriginator = @"http://daccess-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234";

        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            //Load all of the links in the Listbox
            string html = GetHTML(StartingPage);

            MatchCollection matches = GetLinks(html);

            foreach (Match match in matches)
            {
                string value = match.Groups["link"].Value;

                listBox1.Items.Add(value);
            }
        }

        private void button1_Click(object sender, EventArgs e)
        {
            //Get the temporary page to redirect to
            string tempPage = GetURLBase(listBox1.SelectedItem.ToString()) +
                              GetPageToRedirectTo(listBox1.SelectedItem.ToString(), StartingPage);

            //Get the cookies to use
            CookieContainer cookies = GetCookies(CookieOriginator, tempPage);

            //This is the page with the link to the actual page
            string finalPage = GetPageToRedirectTo(tempPage);

            //Get the byte array representing the pdf file
            byte[] pdf = GetBytesFromHTTP(finalPage, cookies);

            //write the file to disk
            WriteFile(@"D:\temp.pdf", pdf);
        }

        public MatchCollection GetLinks(string s)
        {
            Regex regex = new Regex("href=\"(?<link>.*?)\"", RegexOptions.Multiline);
            return regex.Matches(s);
        }

        public string GetHTML(string url)
        {
            return GetHTML(url, "");
        }

        public string GetHTML(string url, string Referer)
        {
            return GetHTML(url, Referer, new CookieContainer());
        }
     
        public string GetHTML(string url, string Referer, CookieContainer cookies)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;
            myRequest.CookieContainer = cookies;

            string pageSource = "";

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            return pageSource;
        }

        public byte[] GetBytesFromHTTP(string url, CookieContainer cookies)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.CookieContainer = cookies;
            myRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");

            byte[] result = null;
            byte[] buffer = new byte[4096];

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (Stream responseStream = response.GetResponseStream())
                {
                    using (MemoryStream memoryStream = new MemoryStream())
                    {
                        int count = 0;

                        do
                        {
                            count = responseStream.Read(buffer, 0, buffer.Length);
                            memoryStream.Write(buffer, 0, count);
                        } while (count != 0);

                        result = memoryStream.ToArray();
                    }
                }
            }

            return result;
        }

        private string GetURLBase(string url)
        {
            Regex regex = new Regex("(?<base>http://.*?)/");
            return regex.Match(url).Groups["base"].Value;
        }

        private string GetPageToRedirectTo(string url)
        {
            return GetPageToRedirectTo(url, "");
        }

        private string GetPageToRedirectTo(string url, string Referer)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;

            string pageSource = "";

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            string urlToRedirectTo = "";

            //Get URL in Meta Tag
            Regex regex = new Regex("<META.*URL=(?<URL>.*)\"");
            urlToRedirectTo = regex.Match(pageSource).Groups["URL"].Value;

            return urlToRedirectTo;
        }

        private CookieContainer GetCookies(string url, string Referer)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;
            myRequest.CookieContainer = new CookieContainer();
            myRequest.GetResponse().Close();

            return myRequest.CookieContainer;
        }

        public void WriteFile(string FileName, byte[] fileContents)
        {
            FileStream outFile = new FileStream(FileName, FileMode.Create);

            using (BinaryWriter writer = new BinaryWriter(outFile))
            {
                writer.Write(fileContents, 0, fileContents.Length);
            }

            outFile.Dispose();
        }
    }
}



(that was fun to figure out!)


That''s a very strange webpage. I know that it is setting cookies, because after I go to it, my browser has cookies set. However, when I try to use InternetGetCookies(...), it doesn''t return anything. And when I try using HTTPWebRequest to get the cookies, the same thing happens.

And they are session cookies that are created dynamically, so you can''t just use the values in the cookies that are passed to the browser.

It would appear that it uses a google-analytics script to write the cookies that you would need. So, I''m not sure how you could go about doing it...

[Update]
Ok, so I guess it''s just using GA to do some tracking and those cookies are not necessary.

It is setting it''s own cookie, that I can track with HttpFox. But, I can''t understand why I get two different sets of HTML when following the page. If I follow the link in Firefox, the first thing that comes up is a blank page that just redirects. But, in code, it comes up with a page that says "not authorised".

AHA!!!! I figured it out. You have to set the HttpWebRequest.Referer property equal to the referring page.. That gives you the page with the redirect.

The redirected page has two different links in it. One is the page to be redirected to, the other adds a frame to the page with a source. That frame is what sets the cookie. Without that cookie, the page that you are being redirected to never loads.

I''m curious now if I can write something that will actually go all the way through...


You could download the temporary html, and parse it to get the redirected url.


这篇关于通过C#程序下载没有直接地址的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆