如何使用HtmlAgilityPack进行异步调用? [英] How to make asynchronous calls using HtmlAgilityPack?

查看:233
本文介绍了如何使用HtmlAgilityPack进行异步调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取ID为table-matches的表此处.问题是表是使用ajax加载的,因此下载页面时我没有完整的html代码:

I'm trying to get the table with id table-matches available here. The problem is that table is loaded using ajax so I don't get the full html code when I download the page:

string url = "http://www.oddsportal.com/matches/soccer/20180701/";

using (HttpClient client = new HttpClient())
{
    using (HttpResponseMessage response = client.GetAsync(url).Result)
    {
        using (HttpContent content = response.Content)
        {
            string result = content.ReadAsStringAsync().Result;
        }
    }
}

返回的html不包含任何表,因此我尝试查看该库是否存在问题,实际上是我在Chrome(特别是在Dev console F12上)上设置了javascript,并且在浏览器.

the html returned does not contains any table, so I tried to see if there is a problem of the library, infact I setted on Chrome (specifically on the Dev console F12) javascript off and same result on the browser.

Fox解决了这个问题,尽管我要使用WebBrowser,尤其是:

Fox fix this problem I though to use a WebBrowser, in particular:

webBrowser.Navigate("oddsportal.com/matches/soccer/20140221/"); 
HtmlElementCollection elements = webBrowser.Document.GetElementsByTagName("table");

但是我想问一下我是否也可以加载完整的html做异步调用,有人遇到过类似的问题吗?

but I want ask if I can load also the full html doing asynchronus calls, someone has encountered a similar problem?

能否请您分享一个解决方案?谢谢.

Could you please share a solution? Thanks.

推荐答案

此页面的主要问题是table-matches中的内容是通过ajax加载的.而且HttpClientHtmlAgilityPack都无法等待ajax的执行.因此,您需要不同的方法.

The main issue with this page is that content inside table-matches is loaded via ajax. And neither HttpClient nor HtmlAgilityPack unable to wait for ajax to be executed. Therefore, you need different approach.

方法1 -使用任何无头浏览器,例如 PuppeteerSharp

Approach #1 - Use any headless browser like PuppeteerSharp

using PuppeteerSharp;
using System;
using System.Threading.Tasks;

namespace PuppeteerSharpDemo
{
    class Program
    {
        private static String url = "http://www.oddsportal.com/matches/soccer/20180701/";

        static void Main(string[] args)
        {
            var htmlAsTask = LoadAndWaitForSelector(url, "#table-matches .table-main");
            htmlAsTask.Wait();
            Console.WriteLine(htmlAsTask.Result);

            Console.ReadKey();
        }

        public static async Task<string> LoadAndWaitForSelector(String url, String selector)
        {
            var browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = true,
                ExecutablePath = @"c:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
            });
            using (Page page = await browser.NewPageAsync())
            {
                await page.GoToAsync(url);
                await page.WaitForSelectorAsync(selector);
                return await page.GetContentAsync();
            }
        }
    }
}

出于清洁的目的,我在此处发布了输出.并且一旦获得html内容,您就可以使用HtmlAgilityPack对其进行解析.

In purpose of cleanness, I've posted output here here. And once you get html content you are able to parse it with HtmlAgilityPack.

方法2 -使用纯 Selenium WebDriver .可以在无头模式中启动.

Approach #2 - Use pure Selenium WebDriver. Can be launched in headless mode.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
using System;

namespace SeleniumDemo
{
    class Program
    {
        private static IWebDriver webDriver;
        private static TimeSpan defaultWait = TimeSpan.FromSeconds(10);
        private static String targetUrl = "http://www.oddsportal.com/matches/soccer/20180701/";
        private static String driversDir = @"../../Drivers/";

        static void Main(string[] args)
        {
            webDriver = new ChromeDriver(driversDir);
            webDriver.Navigate().GoToUrl(targetUrl);
            IWebElement table = webDriver.FindElement(By.Id("table-matches"));
            var innerHtml = table.GetAttribute("innerHTML");
        }

        #region (!) I didn't even use this, but it can be useful (!)
        public static IWebElement FindElement(By by)
        {
            try
            {
                WaitForAjax();
                var wait = new WebDriverWait(webDriver, defaultWait);
                return wait.Until(driver => driver.FindElement(by));
            }
            catch
            {
                return null;
            }
        }

        public static void WaitForAjax()
        {
            var wait = new WebDriverWait(webDriver, defaultWait);
            wait.Until(d => (bool)(d as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0"));
        }
        #endregion
    }
}

方法3 -模拟Ajax请求

如果使用Fiddler或浏览器的探查器(F12)分析页面加载,则可以看到所有数据都来自以下两个请求:

If you analyse the page loading using Fiddler or browser's profiler (F12) you can see that all data is coming with these two requests:

因此,您可以尝试使用

So you can try to execute them directly using HttpClient. But in this case you may need to track authorization headers and maybe something else with each HTTP request.

这篇关于如何使用HtmlAgilityPack进行异步调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆