从安全网站抓取数据或自动化平凡的任务 [英] Scraping data from a secure website or automating mundane task

查看:51
本文介绍了从安全网站抓取数据或自动化平凡的任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网站,我需要使用用户名和密码以及验证码登录.

I have a website where I need to login with username and password and captcha.

进入后,我有一个包含预订的控制面板.对于每个预订,都有一个详细信息页面的链接,其中包含进行预订的人的电子邮件地址.

Once in I have a control panel that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking.

每天我都需要一份所有这些电子邮件地址的列表,以便向他们发送电子邮件.

Each day I need a list of all these email addresses to send an email to them.

我知道如何在 .NET 中抓取网站以获取这些类型的详细信息,但不知道如何抓取需要登录的网站.

I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in.

我看过一篇文章,我可以将 cookie 作为标题传递,这应该可以解决问题,但这需要我在 firebug 中查看 cookie 并将其复制并粘贴过去.

I seen an article where I can pass the cookie as a header and that should do the trick but that would require me to view the cookie in firebug and copy and paste it over.

这会被非技术人员起诉,所以这不是最好的选择.

This would be sued by a non technical person so that's not really the best option.

我在想的另一件事是他们可以运行的脚本可以在浏览器中自动执行此操作吗?有关如何执行此操作的任何提示?

The other thing I was thinking is a script they can run that automates this in the browser? Any tips on how to do this?

推荐答案

无论您是通过 HtmlAgilityPack 还是使用 HttpWebRequest 查询网页,您都应该知道一些事情> 直接类(HtmlAgilityPack 使用它):如何处理Cookies.

There's something you should know, no matter if you're querying the web through HtmlAgilityPack or using HttpWebRequest class directly (HtmlAgilityPack uses it): How to handle Cookies.

以下是您应该遵循的基本步骤:

Here's basically the steps you should follow:

  • 加载你要登录的页面
  • 使用POST方法(用户名、密码或页面请求的任何内容)提交登录所需的信息
  • 保存响应中的 Cookie,并从现在开始使用这些 Cookie.
  • 使用那些 Cookie 请求页面并使用 HtmlAgilityPack 解析它.
  • Load the page you want to be logged in
  • Submit the required info to log in using POST method (username, password, or whatever the page requests)
  • Save the Cookies in the response, and use those Cookies from now on.
  • Request the page with those Cookies and parse it with HtmlAgilityPack.

这是我在使用 HtmlAgilityPack 时经常做的事情:使用 HttpWebRequest 向网站发送请求,而不是使用 Load(..) HtmlWeb的方法.

Here's something I always do when using HtmlAgilityPack: Send request to the website using HttpWebRequest instead of doing this using Load(..) method of HtmlWeb class.

计算一下,HtmlDocument 类中Load 方法的一个参数接收到一个Stream.您所要做的就是传递response 流(由request.GetResponseStream() 获得),您将拥有所需的HtmlDocument 对象.

Take in count, that one of the parameters of Load method in HtmlDocument class receives a Stream. All you have to do is pass the response stream (obtained by request.GetResponseStream()) and you will have the HtmlDocument object you need.

我建议您安装 Fiddler.它是一个非常棒的工具,可以检查来自浏览器或应用程序的 HTTP 请求/响应.

I suggest you installing Fiddler. It is a really great tool to inspect HTTP requests/responses, either from your browser or from your application.

运行Fiddler,尝试通过浏览器登录网站,看看浏览器向页面发送了什么,页面返回了什么,这正是你需要用HttpWebRequest 类.

Run Fiddler, and try to log on the site through the browser, and see what the browser sends to the page and what the page returns, and that's exactly what you need to emulate using HttpWebRequest class.

这个想法不仅仅是在标头中传递一个静态的 Cookie.必须是登录后页面返回的Cookie.

The idea isn't just to pass a static Cookie in the header. It must be the Cookie returned by the page after logged in.

要处理Cookie,请查看HttpWebRequest.CookieContainer 属性.这比你想象的要容易.您需要做的就是声明一个 CookieContainer 变量(空),并在向网站发送任何请求之前将其分配给该属性.当网站做出响应时,Cookie 应自动添加到该容器中,以便您下次请求网站时可以使用它们.

To handle Cookies, take a look at HttpWebRequest.CookieContainer property. It's easier than you think. All you need to do is declare a CookieContainer variable (empty), and assign it to that property before sending any request to the website. When the website gives a response, the Cookies should be added to that container automatically, so you will be able to use them the next time you request the website.

编辑 2:

如果您需要的只是通过浏览器自动执行的脚本,请查看WatiN 图书馆.我相信在您看到一两个如何使用它的示例后,您将能够自己运行它;-)

If all you need is a script to automate it through your browser, take a look at WatiN library. I'm sure you will be able to run it by yourself after you see one or two examples of how to use it ;-)

这篇关于从安全网站抓取数据或自动化平凡的任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆