C#Web客户端 - 查看源代码的问题 [英] C# WebClient - View source question

查看:151
本文介绍了C#Web客户端 - 查看源代码的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是C#Web客户端后登录信息到一个页面并阅读所有的结果。

I'm using a C# WebClient to post login details to a page and read the all the results.

我试图加载该页面包含闪光灯(其中,在浏览器中,转换成HTML)。我猜测它的闪存,以避免被搜索引擎???

The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines???

我感兴趣的闪光灯只是文本(不是图像/视频)等,并当我视图选择源文件在Firefox中我的确看到了文本,HTML中,我想看到​​的。

The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see.

(有趣的是,当我查看源为全网页我看不到文本,HTML中,我想看到​​的。这会是什么关系?)

(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)

目前后,我已经张贴了我的登录信息,并加载了HTML回来了,我看不表明闪光灯HTML(我仿佛看到源整个页面)的页面。

Currently after I have posted my login details, and loaded the HTML back, I see the page which does NOT show the flash HTML (as if I had viewed source for the whole page).

由于提前,

吉姆

PS:我要指出的是,POST实际工作,我登录成功

PS: I should point out that the POST is actually working, my log in is successful.

推荐答案

提琴手(或类似工具)是非常宝贵的追查这样的屏幕抓取问题。使用普通浏览器,并与小提琴手活跃,看着正在作出的所有请求,当您去通过登录和导航过程中得到你想要的数据。在这两者之间,你可能会看到一个或更多的东西,你的代码做不同的服务器正在响应,因此您展示不同的HTML不是一个真正的客户端。

Fiddler (or similar tool) is invaluable to track down screen-scraping problems like this. Using a normal browser and with fiddler active, look at all the requests being made as you go through the login and navigation process to get to the data you want. In between, you will likely see one or more things that your code is doing differently which the server is responding to and hence showing you different HTML than a real client.

下面的东西(认为它是刮痧101),列表是你要寻找什么。下面的大多数东西可能你已经做的东西,但我包括一切的完整性。

The list of stuff below (think of it as "scraping 101") is what you want to look for. Most of the stuff below is probably stuff you're already doing, but I included everything for completeness.

为了有效地刮,你可能需要处理的一个或多个以下的:

In order to scrape effectively, you may need to deal with one or more of the following:


  1. 饼干和/或隐藏字段。当你在网站上的任何页面显示出来,你通常会得到一个正常的会话cookie和/或隐藏的表单字段,(浏览器)将被传播回服务器上的所有后续请求。您可能会也得到了永久性Cookie。在许多网站,如果一个请求显示出来,而不(用Cookie会话或表单字段网站)适当的cookie,网站将用户重定向到一个不曲奇UI,一个登录页面,或其他不良的位置(从刮板应用程序的角度)。始终确保您捕获的初始请求设置的cookie,忠实地送他们回在后续请求的服务器,但如果这些后续请求中的一个改变一个cookie(在这种情况下传播而不是新的cookie)。

  2. 认证令牌以上的特例是表单的身份验证cookie或隐藏字段。确保你捕捉登录令牌(通常是一个cookie),并把它发回。

  3. POST与GET 这是显而易见的,但要确保你重新使用一个真正的浏览器做同样的HTTP方法。

  4. 表单字段(特别是隐藏的!)我敢肯定你已经这样做了,但一定要发的所有一个真正的浏览器表单域,不只是可见的领域。确保字段是正确的HTML编码。

  5. HTTP标头。您已经检查这一点,但它可能是有意义的再次检查,以确保公正的(非-cookie)标头是相同的。我总是先从完全相同的头,然后开始一个拉出头之一,而只保留了导致失败或返回虚假数据的请求的人。这种方法可以简化您的代码拼抢。

  6. 重定向。这些既可以来自于服务器或客户端脚本(例如,如果用户没有Flash插件加载,重定向到一个非闪光网页)。请参见的WebRequest:如何找到是application / xhtml + XML,文本/ XML的text / html;使用针对此的ContentType ="一个WebRequest的邮政编码;字符集= UTF-8英寸;? 有关如何重定向可以绊倒了屏幕刮板一个疯狂的例子。请注意,如果你使用.NET刮,你需要使用HttpWebRequest的(而不是Web客户端),用于重定向相关的拼抢,因为默认情况下Web客户端不提供您的代码cookie和头连接到第二个办法(重定向后)的请求。参见上面的线程的详细信息

  7. 子请求(框架,AJAX,Flash等。) - 通常,页面元素(不是主要的HTTP请求)将最终获取要刮的数据。你就可以通过查看该HTTP响应包含了你想要的文字,然后向后工作,直到你在网页上找到什么是真正使该内容的请求摸不着头脑。几个网站做的非常疯狂的事情在子请求,如通过AJAX请求压缩或加密的文本,然后使用客户端脚本来解密。如果是这样的话,你需要做更多的有点像逆向工程的工作是什么客户端脚本是做

  8. 排序 - 这个人是显而易见的:在做相同的顺序,一个浏览器客户端执行HTTP请求。这并不意味着你需要做的每个的请求(例如图像)。通常情况下,你只需要做出哪些返回text / html的内容类型的要求,除非你想要的数据是不是在HTML,是在Ajax /闪光灯/等。请求。

  1. cookies and/or hidden fields. when you show up at any page on a site, you'll typically get a session cookie and/or hidden form field which (in a normal browser) would be propagated back to the server on all subsequent requests. You will likely also get a persistent cookie. On many sites, if a requests shows up without a proper cookie (or form field for sites using "cookieless sessions"), the site will redirect the user to a "no cookies" UI, a login page, or another undesirable location (from the scraper app's perspective). always make sure you capture the cookies set on the initial request and faithfully send them back to the server on subsequent requests, except if one of those subsequent requests changes a cookie (in which case propagate that new cookie instead).
  2. authentication tokens a special case of above is forms-authentication cookies or hidden fields. make sure you're capturing the login token (usually a cookie) and sending it back.
  3. POST vs. GET this is obvious, but make sure you're using the same HTTP method that a real browser does.
  4. form fields (esp. hidden ones!) I'm sure you're doing this already, but make sure to send all form fields that a real browser does, not just the visible fields. make sure fields are HTML-encoded properly.
  5. HTTP headers. you already checked this, but it may make sense to check again just to make sure the (non-cookie) headers are identical. I always start with the exact same headers and then start pulling out headers one by one, and only keep the ones that cause the request to fail or return bogus data. this approach simplifies your scraping code.
  6. redirects. These can either come from the server, or from client script (e.g. "if user doesn't have flash plug-in loaded, redirect to a non-flash page"). See WebRequest: How to find a postal code using a WebRequest against this ContentType="application/xhtml+xml, text/xml, text/html; charset=utf-8"? for a crazy example of how redirection can trip up a screen-scraper. Note that if you're using .NET for scraping, you'll need to use HttpWebRequest (not WebClient) for redirect-dependent scraping, because by default WebClient doesn't provide a way for your code to attach cookies and headers to the second (post-redirect) request. See the thread above for more details.
  7. sub-requests (frames, ajax, flash, etc.) - often, page elements (not the main HTTP requests) will end up fetching the data you want to scrape. you'll be able to figure this out by looking which HTTP response contains the text you want, and then working backwards until you find what on the page is actually making the request for that content. A few sites do really crazy things in sub-requests, like requesting compressed or encrypted text via ajax, and then using client-side script to decrypt it. if this is the case, you'll need to do a bit more work like reverse-engineering what the client script is doing.
  8. ordering - this one is obvious: make HTTP requests in the same order that a browser client does. that doesn't mean you need to make every request (e.g. images). Typically you only need to make the requests which return text/html content type, unless the data you want is not in the HTML and is in an ajax/flash/etc. request.

这篇关于C#Web客户端 - 查看源代码的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆