如何使用HtmlAgilityPack进行异步调用? [英] How to make asynchronous calls using HtmlAgilityPack?
问题描述
我正在尝试获取ID为table-matches
的表此处.问题是表是使用ajax
加载的,因此下载页面时我没有完整的html
代码:
I'm trying to get the table with id table-matches
available here. The problem is that table is loaded using ajax
so I don't get the full html
code when I download the page:
string url = "http://www.oddsportal.com/matches/soccer/20180701/";
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = client.GetAsync(url).Result)
{
using (HttpContent content = response.Content)
{
string result = content.ReadAsStringAsync().Result;
}
}
}
返回的html
不包含任何表,因此我尝试查看该库是否存在问题,实际上是我在Chrome
(特别是在Dev console F12上)上设置了javascript,并且在浏览器.
the html
returned does not contains any table, so I tried to see if there is a problem of the library, infact I setted on Chrome
(specifically on the Dev console F12) javascript off and same result on the browser.
Fox解决了这个问题,尽管我要使用WebBrowser
,尤其是:
Fox fix this problem I though to use a WebBrowser
, in particular:
webBrowser.Navigate("oddsportal.com/matches/soccer/20140221/");
HtmlElementCollection elements = webBrowser.Document.GetElementsByTagName("table");
但是我想问一下我是否也可以加载完整的html
做异步调用,有人遇到过类似的问题吗?
but I want ask if I can load also the full html
doing asynchronus calls, someone has encountered a similar problem?
能否请您分享一个解决方案?谢谢.
Could you please share a solution? Thanks.
推荐答案
此页面的主要问题是table-matches
中的内容是通过ajax加载的.而且HttpClient
和HtmlAgilityPack
都无法等待ajax的执行.因此,您需要不同的方法.
The main issue with this page is that content inside table-matches
is loaded via ajax. And neither HttpClient
nor HtmlAgilityPack
unable to wait for ajax to be executed. Therefore, you need different approach.
方法1 -使用任何无头浏览器,例如 PuppeteerSharp
Approach #1 - Use any headless browser like PuppeteerSharp
using PuppeteerSharp;
using System;
using System.Threading.Tasks;
namespace PuppeteerSharpDemo
{
class Program
{
private static String url = "http://www.oddsportal.com/matches/soccer/20180701/";
static void Main(string[] args)
{
var htmlAsTask = LoadAndWaitForSelector(url, "#table-matches .table-main");
htmlAsTask.Wait();
Console.WriteLine(htmlAsTask.Result);
Console.ReadKey();
}
public static async Task<string> LoadAndWaitForSelector(String url, String selector)
{
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
Headless = true,
ExecutablePath = @"c:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
});
using (Page page = await browser.NewPageAsync())
{
await page.GoToAsync(url);
await page.WaitForSelectorAsync(selector);
return await page.GetContentAsync();
}
}
}
}
出于清洁的目的,我在此处发布了输出.并且一旦获得html内容,您就可以使用HtmlAgilityPack对其进行解析.
In purpose of cleanness, I've posted output here here. And once you get html content you are able to parse it with HtmlAgilityPack.
方法2 -使用纯 Selenium WebDriver .可以在无头模式中启动.
Approach #2 - Use pure Selenium WebDriver. Can be launched in headless mode.
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
using System;
namespace SeleniumDemo
{
class Program
{
private static IWebDriver webDriver;
private static TimeSpan defaultWait = TimeSpan.FromSeconds(10);
private static String targetUrl = "http://www.oddsportal.com/matches/soccer/20180701/";
private static String driversDir = @"../../Drivers/";
static void Main(string[] args)
{
webDriver = new ChromeDriver(driversDir);
webDriver.Navigate().GoToUrl(targetUrl);
IWebElement table = webDriver.FindElement(By.Id("table-matches"));
var innerHtml = table.GetAttribute("innerHTML");
}
#region (!) I didn't even use this, but it can be useful (!)
public static IWebElement FindElement(By by)
{
try
{
WaitForAjax();
var wait = new WebDriverWait(webDriver, defaultWait);
return wait.Until(driver => driver.FindElement(by));
}
catch
{
return null;
}
}
public static void WaitForAjax()
{
var wait = new WebDriverWait(webDriver, defaultWait);
wait.Until(d => (bool)(d as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0"));
}
#endregion
}
}
方法3 -模拟Ajax请求
如果使用Fiddler或浏览器的探查器(F12)分析页面加载,则可以看到所有数据都来自以下两个请求:
If you analyse the page loading using Fiddler or browser's profiler (F12) you can see that all data is coming with these two requests:
So you can try to execute them directly using HttpClient. But in this case you may need to track authorization headers and maybe something else with each HTTP request.
这篇关于如何使用HtmlAgilityPack进行异步调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!