你如何屏幕刮? [英] How do you Screen Scrape?

查看:115
本文介绍了你如何屏幕刮?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在没有Web服务API可用,你唯一的选择可能是屏幕抓取,但你如何在C#呢?

When there is no webservice API available, your only option might be to Screen Scrape, but how do you do it in c#?

你是怎么想干什么呢?

推荐答案

马特和保罗的答案是正确的。 屏幕刮从一个网站解析HTML通常是一个坏主意,因为:

Matt and Paul's answers are correct. "Screen scraping" by parsing the HTML from a website is usually a bad idea because:


  1. 解析HTML是很困难,尤其是如果它的格式不正确。如果你刮一个非常非常简单的页面,然后定期EX pressions可能会奏效。否则,使用像HTML敏捷包解析框架。

  1. Parsing HTML can be difficult, especially if it's malformed. If you're scraping a very, very simple page then regular expressions might work. Otherwise, use a parsing framework like the HTML Agility Pack.

网站是一个移动的目标即可。你需要更新您的$ C $每个源网站改变了他们的标记结构时间C。

Websites are a moving target. You'll need to update your code each time the source website changes their markup structure.

屏幕抓取不使用Javascript 打好。如果目标网站使用任何类型的动态脚本来操纵网页你将有一个非常困难的时期刮它。这很容易抢HTTP响应,这是一个很大很难凑什么浏览器显示响应包含在响应客户端脚本。

Screen scraping doesn't play well with Javascript. If the target website is using any sort of dynamic script to manipulate the webpage you're going to have a very hard time scraping it. It's easy to grab the HTTP response, it's a lot harder to scrape what the browser displays in response to client-side script contained in that response.

如果屏幕抓取是唯一的选择,这里有一些成功的关键:


  1. 让它尽可能容易改变你看看的模式。如果可能的话,存储该模式为文本文件或在资源文件中的某个地方。使它很容易为​​其他开发人员(或自己在3个月),了解哪些标记你希望找到。

  1. Make it as easy as possible to change the patterns you look for. If possible, store the patterns as text files or in a resource file somewhere. Make it very easy for other developers (or yourself in 3 months) to understand what markup you expect to find.

验证输入并抛出有意义的异常。在您解析code,小心让你的异常非常有帮助。靶位点的将会的变化对你,这种情况发生时,你希望你的错误信息,告诉你不仅什么code的一部分失败了,但是的为什么的失败。同时提及您正在寻找的模式,你是对比较文本。

Validate input and throw meaningful exceptions. In your parsing code, take care to make your exceptions very helpful. The target site will change on you, and when that happens you want your error messages to tell you not only what part of the code failed, but why it failed. Mention both the pattern you're looking for AND the text you're comparing against.

写大量的自动化测试。你希望它是很容易能够运行在一个非破坏性的方式你刮,因为你的将会的做很多迭代开发,以获得正确的模式。作为自动化测试的很多,你可以,它会在支付从长远来看。

Write lots of automated tests. You want it to be very easy to run your scraper in a non-destructive fashion because you will be doing a lot of iterative development to get the patterns right. Automate as much testing as you can, it will pay off in the long run.

考虑一个浏览器自动化工具华廷。如果您需要与目标网站的复杂的相互作用,可能会更容易从视图浏览器本身的角度写你刷屏,而不是用手HTTP请求和响应搞混。

Consider a browser automation tool like Watin. If you require complex interactions with the target website it might be easier to write your scraper from the point of view of the browser itself, rather than mucking with the HTTP requests and responses by hand.

至于如何的C#中的屏幕抓取,您可以使用华廷(见上文),并利用其DOM刮所产生的文档,或者您可以使用 Web客户端类[请参阅MSDN或谷歌]要获取原始的HTTP响应,包括HTML内容,然后使用某种基于文本的分析,提取所需的数据。

As for how to screen scrape in C#, you can either use Watin (see above) and scrape the resulting document using its DOM, or you can use the WebClient class [see MSDN or Google] to get at the raw HTTP response, including the HTML content, and then use some sort of text-based analysis to extract the data you want.

这篇关于你如何屏幕刮?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆