你如何屏幕抓取? [英] How do you Screen Scrape?

查看:29
本文介绍了你如何屏幕抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当没有可用的 webservice API 时,你唯一的选择可能是 Screen Scrape,但你如何在 c# 中做到这一点?

When there is no webservice API available, your only option might be to Screen Scrape, but how do you do it in c#?

你觉得如何做?

推荐答案

Matt 和 Paul 的答案是正确的.通过解析网站的 HTML 进行屏幕抓取"通常是一个坏主意,因为:

Matt and Paul's answers are correct. "Screen scraping" by parsing the HTML from a website is usually a bad idea because:

  1. 解析 HTML 可能很困难,尤其是在格式错误的情况下.如果您正在抓取一个非常非常简单的页面,那么正则表达式可能会起作用.否则,请使用解析框架,如 HTML Agility Pack.

  1. Parsing HTML can be difficult, especially if it's malformed. If you're scraping a very, very simple page then regular expressions might work. Otherwise, use a parsing framework like the HTML Agility Pack.

网站是一个移动的目标.每次源网站更改其标记结构时,您都需要更新代码.

Websites are a moving target. You'll need to update your code each time the source website changes their markup structure.

屏幕抓取不能很好地与 Javascript 配合使用.如果目标网站使用任何类型的动态脚本来操纵网页,您将很难抓取它.抓取 HTTP 响应很容易,抓取浏览器显示的内容以响应该响应中包含的客户端脚本要困难得多.

Screen scraping doesn't play well with Javascript. If the target website is using any sort of dynamic script to manipulate the webpage you're going to have a very hard time scraping it. It's easy to grab the HTTP response, it's a lot harder to scrape what the browser displays in response to client-side script contained in that response.

如果屏幕抓取是唯一的选择,这里有一些成功的关键:

  1. 尽可能轻松地更改您要查找的模式.如果可能,将模式存储为文本文件或资源文件中的某处.让其他开发者(或您自己在 3 个月内)能够轻松了解您希望找到的标记.

  1. Make it as easy as possible to change the patterns you look for. If possible, store the patterns as text files or in a resource file somewhere. Make it very easy for other developers (or yourself in 3 months) to understand what markup you expect to find.

验证输入并抛出有意义的异常.在您的解析代码中,注意让您的异常非常有帮助.目标站点在你身上发生变化,当这种情况发生时,你希望错误消息不仅告诉你代码的哪一部分失败,而且为什么失败.提及您要查找的模式和您要比较的文本.

Validate input and throw meaningful exceptions. In your parsing code, take care to make your exceptions very helpful. The target site will change on you, and when that happens you want your error messages to tell you not only what part of the code failed, but why it failed. Mention both the pattern you're looking for AND the text you're comparing against.

编写大量自动化测试.您希望以非破坏性方式轻松运行抓取工具,因为您进行大量迭代开发以获取正确的模式.尽可能多地自动化测试,从长远来看会有所回报.

Write lots of automated tests. You want it to be very easy to run your scraper in a non-destructive fashion because you will be doing a lot of iterative development to get the patterns right. Automate as much testing as you can, it will pay off in the long run.

考虑使用浏览器自动化工具,例如 Watin.如果您需要与目标网站进行复杂的交互,从浏览器本身的角度编写爬虫可能更容易,而不是手动处理 HTTP 请求和响应.

Consider a browser automation tool like Watin. If you require complex interactions with the target website it might be easier to write your scraper from the point of view of the browser itself, rather than mucking with the HTTP requests and responses by hand.

至于如何在 C# 中进行屏幕抓取,您可以使用 Watin(见上文)并使用其 DOM 抓取生成的文档,或者您可以使用 WebClient类 [参见 MSDN 或 Google] 获取原始 HTTP 响应,包括 HTML 内容,然后使用某种基于文本的分析来提取您想要的数据.

As for how to screen scrape in C#, you can either use Watin (see above) and scrape the resulting document using its DOM, or you can use the WebClient class [see MSDN or Google] to get at the raw HTTP response, including the HTML content, and then use some sort of text-based analysis to extract the data you want.

这篇关于你如何屏幕抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆