C#.NET:搜寻动态(JS)网站 [英] C# .NET: Scraping dynamic (JS) websites

查看:54
本文介绍了C#.NET:搜寻动态(JS)网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几小时的失败后,我来到了这里.我需要抓取一个动态生成的网页(使用Vue.JS创建,但我不想共享链接).

After hours of fails, I am coming here. I need to scrape a dynamically generated webpage (made using Vue.JS, but I would prefer not to share the link).

我尝试了多种方法( 1 3 ).他们都不在此网页上工作.

I have tried multiple approaches (1, 2, 3). None of them works on this webpage.

最有前途的解决方案是使用Selenium和PhantomJS.我是这样尝试的,但不确定为什么它甚至不适用于Google:

The most promising solution was using Selenium and PhantomJS. I tried it like this and I'm not sure why it's not even working for Google:

private void button1_Click(object sender, EventArgs e) {
        PhantomJSDriverService service = PhantomJSDriverService.CreateDefaultService();
        service.IgnoreSslErrors = true;
        service.LoadImages = false;
        service.ProxyType = "none";

        var driver = new PhantomJSDriver(service); // I also tried: new PhantomJSDriver();
        driver.Manage().Timeouts().PageLoad = TimeSpan.FromSeconds(10);
        driver.Url = "https://google.com";
        driver.Navigate();

        var source = driver.PageSource;
        textBox1.AppendText(source);
}

不起作用:

我也尝试过使用WebBrowser控件,但是页面从未完全加载:

I also tried with a WebBrowser Control, but the page never fully loads:

(我发现WebBrowser只是实例化IE,并且尝试在独立IE浏览器中打开目标网站后,该网页也永远不会完全加载,因此看到相同的行为很有意义由于这个事实,我认为我必须绑定到Selenium& PhantomJS.)

( I found out WebBrowser just instantiates IE, and after trying to open the target website in standalone IE browser, the webpage also never loads completely, so it makes sense to see the same behaviour inside WebView. I think I am bound to Selenium&PhantomJS due to this fact.)

当然这应该没有那么复杂.如何正确执行?

Surely this shouldn't be so complicated. How to do it properly?

推荐答案

如果您需要抓取网站,则可以使用ScrapySharp抓取框架.您可以将它作为nuget添加到项目中. https://www.nuget.org/packages/ScrapySharp/

if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget. https://www.nuget.org/packages/ScrapySharp/

安装软件包ScrapySharp-版本2.6.2

Install-Package ScrapySharp -Version 2.6.2

它具有许多有用的属性来访问页面上的不同元素.例如,要访问页面的整个HTML,您可以使用以下代码:

It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:

        ScrapingBrowser Browser = new ScrapingBrowser();
        WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
        HtmlNode rawHTML = PageResult.Html;
        Console.WriteLine(rawHTML.InnerHtml);
        Console.ReadLine();

这篇关于C#.NET:搜寻动态(JS)网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆