使用C#,我如何“捕获”来自aspx.net页面的文档,其中包含一个模糊,安全的来源? [英] Using C#, how do I "catch" a document from a aspx.net page with an obscured, secure source?

查看:67
本文介绍了使用C#,我如何“捕获”来自aspx.net页面的文档,其中包含一个模糊,安全的来源?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序来自动从公共网站收集下载的文件。我的程序汇集了站点所需的参数,然后提交POST。源网站响应HTML POST请求。我正在使用GetResponseStream捕获HTML响应。我错过了关键部分,以捕获ASPX.Net页面传输的下载文档。出于安全原因,源网站的文档位置和URI将始终被隐藏。我需要做些什么才能从网站上抓取下载的XML文档?



这里有一些额外的信息可以帮助澄清我遇到的问题:

1)ASP.net网站从安全的位置提供所请求的文件,因此我无法从网站上拉文件。

2)使用浏览器并手动操作下载请求,我收到2个POST响应,一个用于刷新的网页,另一个用于下载文档。

3 )使用我的程序(使用HTML Agility Pack),我将相同的信息发布到网站,但只接收刷新网页的POST响应。



我怀疑该网站没有发送下载响应,因为它期望某种验证信息,我不通过我的程序发送。有没有人碰到这个?你做了什么来解决?

I have a written a program to automate the collection of downloaded documents from a public website. My program assembles the parameters needed by the site, then submits the POST. The source website responds with a response to the HTML POST request. I am capturing the HTML response with GetResponseStream. I am missing the key piece to also capture the downloaded document being transmitted by the ASPX.Net page. The document location and URI from the source website will always be obscured for security reasons. What do I need to do to "catch" a downloaded XML document from a website?

Here is some additional information to help clarify the issue I am having:
1) the ASP.net website is serving the requested document from a secured location, so I cannot "pull" the document from the website.
2) Using a browser and manually operating the download request, I receive 2 POST responses, one for the refreshed web page and one for the document to be downloaded.
3) Using my program (with HTML Agility Pack), I POST the same information to the website, but only receive the POST response with the refreshed web page.

I suspect that this website is not sending the download response because it is expecting some sort of validation info that I am not sending via my program. Has anyone run into this? What have you done to resolve?

推荐答案

你怀疑的可能是正确的。或者它可能是类似的东西。



如果您可以将文档拉到Web浏览器,您可以随时在代码中执行此操作。唯一的问题是:可能没有通用方法可以保证所有情况下的结果。特别是,这是对的,你不能总是直接做到这一点。特别是,通常用于获取文档的ASP.NET可以在文档站点上拥有一个特殊帐户,并且请求的详细信息(例如,身份验证)完全隐藏在ASP.NET代码中,而后面没有进入。但是浏览器最终通过向ASP.NET(或其他一些)站点发送适当的请求来间接获取文档。所有你需要做的就是模仿浏览器和其他互联网之间的所有HTTP请求的100%。



所以,从理论上讲,你可以学习客户端涉及的每个脚本,并模仿所有脚本。问题是某些脚本仅在您执行某些中间HTTP请求时出现,因此如果您只是在任何固定网页上执行查看页面源,则不会看到它们。换句话说,调查总是可行的,但可能非常困难。



我碰巧成功地解决了几个这样的问题(并没有解决其他问题)那些)。这对我有很大帮助:我安装了一些HTTP间谍程序。主要的Web浏览器通常会提供一些插件产品来完成这些工作。例如,我使用了HttpFox,这是Mozilla可用的插件。然后,您手动操作Web浏览器并收集插件报告的所有HTTP请求/响应数据。然后试着找到目的。我不能保证你的成功,但这种方式被证明是非常有帮助的。



并且请确保你使用你通过这个网络抓取获得的内容公平合理。



-SA
It's likely that what you suspect is right. Or it could be something similar.

If you can pull a document to your Web browser, you can always do it in your code. The only problem is: probably no universal approach can guarantee the result in all cases. In particular, that's right, you cannot always do it directly. In particular, the ASP.NET which is normally used to get the document can have a special account with the document site, and the detail of the request (say, authentication) are completely hidden in the ASP.NET code behind which you have no access to. But the browser ultimately gets the document indirectly, by sending appropriate requests to the ASP.NET (or some other) site. All you need to do it to mimic 100% of all the HTTP requests which go between the browser and the rest of Internet.

So, theoretically speaking, you can study each and every script involved on the client side and mimic all of them. The problem is that some scripts appear only when you perform some intermediate HTTP requests, so you don't see them if you simply do "View Page Source" on any of the fixed Web pages. In other words, the investigation is always possible, but can be quite difficult.

I happened to successfully solve just a couple of such problems (and failed to solve some other ones). Here is what helped me a lot: I installed some HTTP spy program. Major Web browsers usually offer some plug-in products doing such work. I, for example, used HttpFox, the plug-in available for Mozilla. Then you operate the Web browser manually and collect all the HTTP request/response data reported by the plug-in. And then try to find ends. I cannot guarantee your success, but this way is proven to be very helpful.

And please make sure you use the content you got by this Web scraping fairly and legitimately.

—SA


这篇关于使用C#,我如何“捕获”来自aspx.net页面的文档,其中包含一个模糊,安全的来源?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆