查看生成的源代码(后AJAX / JavaScript的),在C# [英] View Generated Source (After AJAX/JavaScript) in C#

查看:168
本文介绍了查看生成的源代码(后AJAX / JavaScript的),在C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法来查看网页的产生源从C#应用程序(所有的AJAX调用和JavaScript DOM操作后,用code发生)未经code打开浏览器?

查看使用的WebRequest 或的 WebClient的对象的工作不错,但如果网页中大量使用JavaScript来改变在页面加载的DOM,那么这些不提供网页的精确描述

我用曾尝试华廷 UI测试框架和他们很好地工作,提供生成的源代码,因为它似乎所有的JavaScript操作完成后。遗憾的是,他们做此通过打开一个实际的网络浏览器,这是非常慢的。我实现了,可卸下这项工作到另一台机器硒服务器,但仍有很大的延迟。

有一个.net库,将加载和分析页面(如浏览器)和吐出产生code?很显然,谷歌和雅虎都没有开放的浏览器,因为他们想要蜘蛛每一个网页(当然他们可能有更多的资源比我...)。

有没有这样的库还是我倒霉,除非我愿意来剖析一款开源浏览器的源$ C ​​$ C?

解决方案:

好了,谢谢大家对你的帮助。我有一个工作的解决方案,约10倍的速度,然后硒。呜!

由于这种旧文章来自beansoftware 我能够使用System.Windows.Forms.WebBrowser控件下载页面,解析它,然后给EM生成的源。即使控制在Windows.Forms的,你仍然可以运行它从Asp.Net(这是我在做什么),只记得System.Window.Forms添加到您的项目引用。

有关于code两个显着的事情。首先,WebBrowser控件被称为一个新的线程。这是因为它必须在单线程单元。

其次,GeneratedSource变量被设置在两个位置。这是不是因为一个智能的设计决定:)我仍然在做这个工作,并且会更新这个答案时,我做的。 wb_DocumentCompleted()被调用多次。第一初始HTML被下载时,然后再次当第一轮的JavaScript的完成。不幸的是,该网站我刮了3种不同的装载阶段。 1)负载最初HTML 2)做第一轮的JavaScript DOM操作3)暂停半秒,然后做了第二轮JS DOM操作。

由于某些原因,第二轮不被wb_DocumentCompleted()函数造成的,但它总能吸引时wb.ReadyState ==完成。那么,为什么不从wb_DocumentCompleted删除它()?我仍然不知道为什么它没有被捕获那里,这也正是beadsoftware文章建议把它。我将继续寻找到它。我只是想发布此code所以任何人谁是有兴趣的可以使用它。尽情享受吧!

 使用的System.Threading;
使用System.Windows.Forms的;

公共类WebProcessor
{
    私人字符串GeneratedSource {获得;组; }
    私人字符串URL {获得;组; }

    公共字符串GetGeneratedHTML(字符串URL)
    {
        URL =网址;

        线程t =新主题(新的ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        返回GeneratedSource;
    }

    私人无效WebBrowserThread()
    {
        web浏览器WB =新的web浏览器();
        wb.Navigate(URL);

        wb.DocumentCompleted + =
            新WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        而(wb.ReadyState!= WebBrowserReadyState.Complete)
            Application.DoEvents();

        //添加了这一行,因为最终的HTML需要一段时间才能显示出来
        GeneratedSource = wb.Document.Body.InnerHtml;

        wb.Dispose();
    }

    私人无效wb_DocumentCompleted(对象发件人,
        WebBrowserDocumentCompletedEventArgs五)
    {
        web浏览器WB =(web浏览器)发送器;
        GeneratedSource = wb.Document.Body.InnerHtml;
    }
}
 

解决方案

有可能使用浏览器的一个实例(在你的情况下:IE控件)。你可以在你的应用程序方便地使用,并打开一个网页。那么该控件将加载和处理任何JavaScript。一旦做到这一点,你可以访问控制DOM对象,并得到了跨preTEDcode。

Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code?

Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page.

I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay.

Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...).

Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser?

SOLUTION

Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo!

Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrowser control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references.

There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment.

Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation.

For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy!

using System.Threading;
using System.Windows.Forms;

public class WebProcessor
{
    private string GeneratedSource{ get; set; }
    private string URL { get; set; }

    public string GetGeneratedHTML(string url)
    {
        URL = url;

        Thread t = new Thread(new ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        return GeneratedSource;
    }

    private void WebBrowserThread()
    {
        WebBrowser wb = new WebBrowser();
        wb.Navigate(URL);

        wb.DocumentCompleted += 
            new WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        while (wb.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        //Added this line, because the final HTML takes a while to show up
        GeneratedSource= wb.Document.Body.InnerHtml;

        wb.Dispose();
    }

    private void wb_DocumentCompleted(object sender, 
        WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = (WebBrowser)sender;
        GeneratedSource= wb.Document.Body.InnerHtml;
    }
}

解决方案

it is possibly using an instance of a browser (in you case: the ie control). you can easily use in your app and open a page. the control will then load it and process any javascript. once this is done you can access the controls dom object and get the "interpreted" code.

这篇关于查看生成的源代码(后AJAX / JavaScript的),在C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆