从动态服务器抓取 html 列表数据 [英] Scraping html list data from a dynamic server

查看:43
本文介绍了从动态服务器抓取 html 列表数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好!

抱歉转储问题,这是我最后的手段.我发誓我尝试了无数其他 Stackoverflow 问题、不同的框架等,但这些似乎没有帮助.

我有以下问题:一个网站显示了一个数据列表(前面有大量的 div、li、span 等标签,它是一个很大的 HTML.)

我正在编写一个工具,从大量其他 div 标签内的特定列表中获取数据,下载并输出一个 excel 文件.

我尝试访问的网站是动态的.所以你打开网站,它加载了一点,然后列表出现(可能是一些 JS 之类的东西).当我尝试通过 C# 中的 webRequest 下载网站时,我得到的 html 几乎是空的,有大量的空白、大量非 html 的内容,以及一些垃圾数据.

现在:我很习惯 C#、HTMLAgillityPack 和无数其他库,但不太习惯与 Web 相关的东西.我尝试了 CefSharp、Chromium 等所有这些东西,但不幸的是无法让它们正常工作.

我想在我的程序中有一个 HTML 来处理它,它看起来和你看到的 HTML 完全一样您在 chrome wenn 中打开开发控制台,访问上面提到的网站.HTML 解析器在那里完美运行.

这就是我想象代码看起来如何简化的方式.

极限 C# 伪代码:

WebBrowserEngine web = new WebBrowserEngine()web.LoadURLuntilFinished(url);//执行所有 JS 和其他内容String html = web.getHTML();web.close();

我的目标是伪代码中的字符串 html 看起来与 Chrome 开发选项卡中的字符串完全一样.也许在其他地方发布了解决方案,但我发誓我找不到它,找了好几天了.

非常感谢安迪的帮助.

解决方案

@SpencerBench 说得很好

<块引用>

可能是页面正在使用滚动状态、元素可见性或元素位置的某种组合来触发内容加载.如果是这种情况,那么您需要弄清楚它是什么并以编程方式触发它.

要针对您的特定用例回答问题,我们需要了解您要从中抓取数据的页面的行为,或者正如我在评论中所问的那样,您如何知道该页面已完成"?

但是,您可以对应该作为起点的问题给出一个相当笼统的答案.

此答案使用 Selenium,这是一个通常用于 Web UI 自动化测试的包,但作为他们在主页上说,这不是它唯一的用途.

<块引用>

主要用于自动化 Web 应用程序以进行测试,但当然不仅限于此.无聊的基于 Web 的管理任务也可以(而且应该)自动化.

我正在抓取的网站

所以首先我们需要一个网站.我已经使用 ASP.net core MVC 和 .net core 3.1 创建了一个,虽然网站的技术堆栈并不重要,但重要的是要抓取的页面的行为.这个网站有 2 个页面,毫无想象力地称为 Page1 和 Page2.

页面控制器

这些控制器没有什么特别之处:

命名空间 StackOverflow68925623Website.Controllers{使用 Microsoft.AspNetCore.Mvc;公共类 Page1Controller :控制器{公共 IActionResult 索引(){返回视图(第1页");}}}

命名空间 StackOverflow68925623Website.Controllers{使用 Microsoft.AspNetCore.Mvc;公共类 Page2Controller :控制器{公共 IActionResult 索引(){返回视图(第2页");}}}

API 控制器

还有一个 API 控制器(即它返回数据而不是视图),视图可以异步调用它来获取一些要显示的数据.这只是创建了一个包含请求数量的随机字符串的数组.

命名空间 StackOverflow68925623Website.Controllers{使用 Microsoft.AspNetCore.Mvc;使用系统;使用 System.Collections.Generic;使用 System.Text;[路由(api/[控制器]")][API控制器]公共类 DataController : ControllerBase{[HttpGet(创建")]公共 IActionResult 创建(int numberOfElements){var response = new List();for (var i = 0; i 

观看次数

Page1 的视图如下所示:

@{ViewData[标题"] =第 1 页";}<div class="text-center"><div id="list"/><script src=~/lib/jquery/dist/jquery.min.js"></script><脚本>var apiUrl = 'https://localhost:44394/api/Data/Create';$(document).ready(function () {$('#list').append('<li id="loading">Loading...</li>');$.ajax({url: apiUrl + '?numberOfElements=20000',数据类型:'json',成功:功能(数据){$('#loading').remove();var 插入 = ''for (var 数据项) {插入 += '
  • '+ 项目 + '</li>';}插入 = '
      '+ 插入 + '</ul>';$('#list').html(插入);},错误:函数(xht,状态){alert('错误:' + 状态);}});});
  • 所以当页面第一次加载时,它只包含一个名为 list 的空 div,但是页面加载触发器的函数传递给 jQuery 的 $(document).ready 函数,它对 API 控制器进行异步调用,请求包含 20,000 个元素的数组.在呼叫进行中时,正在加载..."显示在屏幕上,当调用返回时,它将被包含接收到的数据的无序列表替换.这是以一种对自动化 UI 测试或屏幕抓取器的开发人员友好的方式编写的,因为我们可以通过测试页面是否包含 ID 为 results<的元素来判断是否所有数据都已加载/代码>.

    Page2 的视图如下所示:

    @{ViewData[标题"] =第 2 页";}<div class="text-center"><div id="列表"><ul id=结果"/>

    <script src=~/lib/jquery/dist/jquery.min.js"></script><脚本>var apiUrl = 'https://localhost:44394/api/Data/Create';var requestCount = 0;var maxRequests = 20;$(document).ready(function () {获取数据();});函数 getDataIfAtBottomOfPage() {console.log("scroll - "+ requestCount + "requests");如果 (requestCount (document.documentElement.scrollHeight - window.innerHeight - 100)) {获取数据();}}}函数 getData() {window.onscroll = 未定义;请求计数++;$('results2').append('<li id="loading">Loading...</li>');$.ajax({url: apiUrl + '?numberOfElements=50',数据类型:'json',成功:功能(数据){var 插入 = ''for (var 数据项) {插入 += '

  • '+ 项目 + '</li>';}$('#loading').remove();$('#results').append(insert);如果 (requestCount
  • 这提供了更好的用户体验,因为它以多个较小的块从 API 控制器请求数据,因此第一块数据出现得相当快,一旦用户向下滚动到页面底部附近的某个位置,下一个请求数据块,直到请求并显示了 20 个数据块,此时文本那就是所有的人"显示出来.添加到无序列表的末尾.但是,以编程方式与之交互更困难,因为您需要向下滚动页面才能显示新数据.

    (是的,这个实现有点问题——如果用户太快到达页面底部,那么在他们向上滚动一点之前不会请求下一块数据.但问题不在于如何在网页中实现这种行为,但关于如何抓取显示的数据,所以请原谅我的错误.)

    刮刀

    我已经将刮板作为 xUnit 单元测试项目实施,只是因为我没有对我从网站上刮下的数据做任何事情,除了 Asserting 它是正确的长度,因此证明我没有过早地假设我正在抓取的网页已完成".您可以将大部分代码(Assert 除外)放入任何类型的项目中.

    创建刮板项目后,您需要添加 Selenium.WebDriverSelenium.WebDriver.ChromeDriver nuget 包.

    页面对象模型

    我正在使用 页面对象模型 模式在与页面的功能交互和如何编码交互的实现细节之间提供一个抽象层.网站中的每个页面都有一个对应的页面模型类,用于与该页面进行交互.

    首先,一个带有一些代码的基类,这些代码对于多个页面模型类是通用的.

    命名空间 StackOverflow68925623Scraper{使用系统;使用 OpenQA.Selenium;使用 OpenQA.Selenium.Support.UI;公共类 PageModel{受保护的 PageModel(IWebDriver 驱动程序){this.Driver = 驱动程序;}受保护的 IWebDriver 驱动程序 { get;}公共无效 ScrollToTop(){var js = (IJavaScriptExecutor)this.Driver;js.ExecuteScript("window.scrollTo(0, 0)");}公共无效滚动底部(){var js = (IJavaScriptExecutor)this.Driver;js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight)");}受保护的 IWebElement GetById(string id){尝试{返回 this.Driver.FindElement(By.Id(id));}捕获(NoSuchElementException){返回空;}}受保护的 IWebElement AwaitGetById(string id){var wait = new WebDriverWait(Driver, TimeSpan.FromSeconds(10));return wait.Until(e => e.FindElement(By.Id(id)));}}}

    这个基类为我们提供了 4 个方便的方法:

    • 滚动到页面顶部
    • 滚动到页面底部
    • 获取指定ID的元素,如果不存在则返回null
    • 获取具有所提供 ID 的元素,如果尚不存在,则等待最多 10 秒让它出现

    网站中的每个页面都有自己的模型类,派生自该基类.

    命名空间 StackOverflow68925623Scraper{使用 OpenQA.Selenium;公共类 Page1Model : PageModel{公共 Page1Model(IWebDriver 驱动程序):基础(驱动程序){}公共 IWebElement AwaitResults =>this.AwaitGetById(结果");公共无效导航(){this.Driver.Navigate().GoToUrl("https://localhost:44394/Page1");}}}

    命名空间 StackOverflow68925623Scraper{使用 OpenQA.Selenium;公共类 Page2Model : PageModel{公共 Page2Model(IWebDriver 驱动程序):基础(驱动程序){}公共 IWebElement 结果 =>this.GetById(结果");公共无效导航(){this.Driver.Navigate().GoToUrl("https://localhost:44394/Page2");}}}

    还有 Scraper 类:

    命名空间 StackOverflow68925623Scraper{使用 OpenQA.Selenium.Chrome;使用系统;使用 System.Threading;使用 Xunit;公共类 Scraper{[事实]公共无效 TestPage1(){//安排var driver = new ChromeDriver();var page = new Page1Model(driver);page.Navigate();尝试{//行为var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);//断言Assert.Equal(20000, actualResults.Length);}最后{//确保浏览器窗口关闭,即使事情变成梨形驱动程序退出();}}[事实]公共无效 TestPage2(){//安排var driver = new ChromeDriver();var page = new Page2Model(driver);page.Navigate();尝试{//行为while (!page.Results.Text.Contains(这就是所有人")){线程睡眠(1000);page.ScrollToBottom();page.ScrollToTop();}var actualResults = page.Results.Text.Split(Environment.NewLine);//Assert - 我们期望 1001 因为额外的这就是所有人";Assert.Equal(1001, actualResults.Length);}最后{//确保浏览器窗口关闭,即使事情变成梨形驱动程序退出();}}}}

    那么,这里发生了什么?

    //排列var driver = new ChromeDriver();var page = new Page1Model(driver);page.Navigate();

    ChromeDriver 位于 Selenium.WebDriver.ChromeDriver 包中,并从 Selenium.WebDriver 实现了 IWebDriver 接口> 打包与 Chrome 浏览器交互的代码.其他包包含所有流行浏览器的实现.实例化驱动程序对象会打开一个浏览器窗口,并调用它的 Navigate 方法将浏览器定向到我们想要测试/抓取的页面.

    //行动var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);

    因为在 Page1 上,results 元素在显示所有数据之前不存在,并且不需要用户交互即可显示,我们使用页面模型的 AwaitResults 属性来等待该元素出现并在它出现后返回.

    AwaitResults 返回一个表示元素的 IWebElement 实例,该实例又具有我们可以用来与元素交互的各种方法和属性.在这种情况下,我们使用它的 Text 属性,它将元素的内容作为字符串返回,没有任何标记.由于数据显示为无序列表,列表中的每个元素都以换行符分隔,所以我们可以使用StringSplit方法将其转换为一个字符串数组.

    Page2 需要不同的方法 - 我们不能使用 results 元素的存在来确定数据是否已经全部显示,因为该元素在页面从一开始,我们就需要检查字符串That's all peoples";它写在最后一块数据的末尾.此外,数据也不是一次性加载的,我们需要不断向下滚动才能触发下一块数据的加载.

    //行动while (!page.Results.Text.Contains(这就是所有人")){线程睡眠(1000);page.ScrollToBottom();page.ScrollToTop();}var actualResults = page.Results.Text.Split(Environment.NewLine);

    由于我之前提到的 UI 中的错误,如果我们过快地到达页面底部,则不会触发下一块数据的获取,并且在已经到达底部时尝试向下滚动页面不会引发另一个滚动事件.这就是我滚动到页面底部然后回到顶部的原因——这样我就可以保证引发滚动事件.您永远不知道,您试图从中抓取数据的网站本身可能有问题.

    一旦这就是所有人"文本已经出现,我们可以继续获取 results 元素的 Text 属性,并像以前一样将其转换为字符串数组.

    //Assert - 我们期望 1001 因为额外的这就是所有人"Assert.Equal(1001, actualResults.Length);

    这是不会出现在您的代码中的部分.因为我正在抓取一个受我控制的网站,所以我确切地知道它应该显示多少数据,以便我可以检查我是否获得了所有数据,因此我的抓取代码是否正常工作.

    进一步阅读

    Selenium 绝对初学者入门:https://www.guru99.com/selenium-csharp-tutorial.html

    (那篇文章中的一个好奇是它从创建控制台应用程序项目开始,然后将其输出类型更改为类库并手动添加单元测试包的方式,当该项目可以使用 Visual Studio 的其中一个创建时单元测试项目模板.它最终到达了正确的位置,尽管通过了一条相当奇怪的路线.)

    Selenium 文档:https://www.selenium.dev/documentation/

    祝你刮刮乐!

    Hallo guys!

    Sorry for the dump question, this is my last resort. I swear i triend countless of other Stackoverflow questions, different Frameworks, etc., but those didnt seem to help.

    Ich have the following Problem: A website displays a list of data (there is a TON of div, li, span etc. tags infront, its a big HTML.)

    Im writing a tool that fetches data from a specific list inside a ton of other div tags, downloads it and outputs an excel file.

    The website im trying to access, is dynamic. So you open the website, it loads a little bit, and then the list appears (probably some JS and stuff). When i try to download the website via a webRequest in C#, the html I get ist almost empty with a ton on white spaces, lots of non-html stuff, some garbage data as well.

    Now: Im pretty used to C#, HTMLAgillityPack, and countless other libraries, not so much in web related stuff tho. I tried CefSharp, Chromium etc. all of those stuff, but couldnt get them to work properly unfortunately.

    I want to have a HTML in my program to work with that looks exactly like the HTML that you see when you open the dev console in chrome wenn visting the website mentined above. The HTML parser works flwalessly there.

    This is how I image how the code could look like simplified.

    Extreme C# pseudocode:

    WebBrowserEngine web = new WebBrowserEngine()
    web.LoadURLuntilFinished(url); // with all the JS executed and stuff
    String html = web.getHTML();
    web.close();
    

    My Goal would be that the string html in the pseudocode looks exactly like the one in the Chrome dev tab. Maybe there is a solution posted somewhere else but i swear i coudlnt find it, been looking for days.

    Andy help is greatly appreciated.

    解决方案

    @SpencerBench is spot on in saying

    It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically.

    To answer the question for your specific use case, we need to understand the behaviour of the page you want to scrape data from, or as I asked in the comments, how do you know the page is "finished"?

    However, it's possible to give a fairly generic answer to the question which should act as a starting point for you.

    This answer uses Selenium, a package which is commonly used for automating testing of web UIs, but as they say on their home page, that's not the only thing it can be used for.

    Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well.

    The web site I'm scraping

    So first we need a web site. I've created one using ASP.net core MVC with .net core 3.1, although the web site's technology stack isn't important, it's the behaviour of the page you want to scrape which is important. This site has 2 pages, unimaginatively called Page1 and Page2.

    Page controllers

    There's nothing special in these controllers:

    namespace StackOverflow68925623Website.Controllers
    {
        using Microsoft.AspNetCore.Mvc;
    
        public class Page1Controller : Controller
        {
            public IActionResult Index()
            {
                return View("Page1");
            }
        }
    }
    

    namespace StackOverflow68925623Website.Controllers
    {
        using Microsoft.AspNetCore.Mvc;
    
        public class Page2Controller : Controller
        {
            public IActionResult Index()
            {
                return View("Page2");
            }
        }
    }
    

    API controller

    There's also an API controller (i.e. it returns data rather than a view) which the views can call asynchronously to get some data to display. This one just creates an array of the requested number of random strings.

    namespace StackOverflow68925623Website.Controllers
    {
        using Microsoft.AspNetCore.Mvc;
        using System;
        using System.Collections.Generic;
        using System.Text;
    
        [Route("api/[controller]")]
        [ApiController]
        public class DataController : ControllerBase
        {
            [HttpGet("Create")]
            public IActionResult Create(int numberOfElements)
            {
                var response = new List<string>();
                for (var i = 0; i < numberOfElements; i++)
                {
                    response.Add(RandomString(10));
                }
    
                return Ok(response);
            }
    
            private string RandomString(int length)
            {
                var sb = new StringBuilder();
                var random = new Random();
                for (var i = 0; i < length; i++)
                {
                    var characterCode = random.Next(65, 90); // A-Z
                    sb.Append((char)characterCode);
                }
    
                return sb.ToString();
            }
        }
    }
    

    Views

    Page1's view looks like this:

    @{
        ViewData["Title"] = "Page 1";
    }
    
    <div class="text-center">
        <div id="list" />
    
        <script src="~/lib/jquery/dist/jquery.min.js"></script>
        <script>
            var apiUrl = 'https://localhost:44394/api/Data/Create';
    
            $(document).ready(function () {
                $('#list').append('<li id="loading">Loading...</li>');
                $.ajax({
                    url: apiUrl + '?numberOfElements=20000',
                    datatype: 'json',
                    success: function (data) {
                        $('#loading').remove();
                        var insert = ''
                        for (var item of data) {
                            insert += '<li>' + item + '</li>';
                        }
                        insert = '<ul id="results">' + insert + '</ul>';
                        $('#list').html(insert);
                    },
                    error: function (xht, status) {
                        alert('Error: ' + status);
                    }
                });
            });
        </script>
    </div>
    

    So when the page first loads, it just contains an empty div called list, however the page loading trigger's the function passed to jQuery's $(document).ready function, which makes an asynchronous call to the API controller, requesting an array of 20,000 elements. While the call is in progress, "Loading..." is displayed on the screen, and when the call returns, this is replaced by an unordered list containing the received data. This is written in a way intended to be friendly to developers of automated UI tests, or of screen scrapers, because we can tell whether all the data has loaded by testing whether or not the page contains an element with the ID results.

    Page2's view looks like this:

    @{
        ViewData["Title"] = "Page 2";
    }
    
    <div class="text-center">
        <div id="list">
            <ul id="results" />
        </div>
    
        <script src="~/lib/jquery/dist/jquery.min.js"></script>
        <script>
            var apiUrl = 'https://localhost:44394/api/Data/Create';
            var requestCount = 0;
            var maxRequests = 20;
    
            $(document).ready(function () {
                getData();
            });
    
            function getDataIfAtBottomOfPage() {
                console.log("scroll - " + requestCount + " requests");
                if (requestCount < maxRequests) {
                    console.log("scrollTop " + document.documentElement.scrollTop + " scrollHeight " + document.documentElement.scrollHeight);
                    if (document.documentElement.scrollTop > (document.documentElement.scrollHeight - window.innerHeight - 100)) {
                        getData();
                    }
                }
            }
    
            function getData() {
                window.onscroll = undefined;
                requestCount++;
                $('results2').append('<li id="loading">Loading...</li>');
                $.ajax({
                    url: apiUrl + '?numberOfElements=50',
                    datatype: 'json',
                    success: function (data) {
                        var insert = ''
                        for (var item of data) {
                            insert += '<li>' + item + '</li>';
                        }
                        $('#loading').remove();
                        $('#results').append(insert);
                        if (requestCount < maxRequests) {
                            window.setTimeout(function () { window.onscroll = getDataIfAtBottomOfPage }, 1000);
                        } else {
                            $('#results').append('<li>That\'s all folks');
                        }
                    },
                    error: function (xht, status) {
                        alert('Error: ' + status);
                    }
                });
            }
        </script>
    </div>
    

    This gives a nicer user experience because it requests data from the API controller in multiple smaller chunks, so the first chunk of data appears fairly quickly, and once the user has scrolled down to somewhere near the bottom of the page, the next chunk of data is requested, until 20 chunks have been requested and displayed, at which point the text "That's all folks" is added to the end of the unordered list. However this is more difficult to interact with programmatically because you need to scroll the page down to make the new data appear.

    (Yes, this implementation is a bit buggy - if the user gets to the bottom of the page too quickly then requesting the next chunk of data doesn't happen until they scroll up a bit. But the question isn't about how to implement this behaviour in a web page, but about how to scrape the displayed data, so please forgive my bugs.)

    The scraper

    I've implemented the scraper as a xUnit unit test project, just because I'm not doing anything with the data I've scraped from the web site other than Asserting that it is of the correct length, and therefore proving that I haven't prematurely assumed that the web page I'm scraping from is "finished". You can put most of this code (other than the Asserts) into any type of project.

    Having created your scraper project, you need to add the Selenium.WebDriver and Selenium.WebDriver.ChromeDriver nuget packages.

    Page Object Model

    I'm using the Page Object Model pattern to provide a layer of abstraction between functional interaction with the page and the implementation detail of how to code that interaction. Each of the pages in the web site has a corresponding page model class for interacting with that page.

    First, a base class with some code which is common to more than one page model class.

    namespace StackOverflow68925623Scraper
    {
        using System;
        using OpenQA.Selenium;
        using OpenQA.Selenium.Support.UI;
    
        public class PageModel
        {
            protected PageModel(IWebDriver driver)
            {
                this.Driver = driver;
            }
    
            protected IWebDriver Driver { get; }
    
            public void ScrollToTop()
            {
                var js = (IJavaScriptExecutor)this.Driver;
                js.ExecuteScript("window.scrollTo(0, 0)");
            }
    
            public void ScrollToBottom()
            {
                var js = (IJavaScriptExecutor)this.Driver;
                js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight)");
            }
    
            protected IWebElement GetById(string id)
            {
                try
                {
                    return this.Driver.FindElement(By.Id(id));
                }
                catch (NoSuchElementException)
                {
                    return null;
                }
            }
    
            protected IWebElement AwaitGetById(string id)
            {
                var wait = new WebDriverWait(Driver, TimeSpan.FromSeconds(10));
                return wait.Until(e => e.FindElement(By.Id(id)));
            }
        }
    }
    

    This base class gives us 4 convenience methods:

    • Scroll to the top of the page
    • Scroll to the bottom of the page
    • Get the element with the supplied ID, or return null if it doesn't exist
    • Get the element with the supplied ID, or wait for up to 10 seconds for it to appear if it doesn't exist yet

    And each page in the web site has its own model class, derived from that base class.

    namespace StackOverflow68925623Scraper
    {
        using OpenQA.Selenium;
    
        public class Page1Model : PageModel
        {
            public Page1Model(IWebDriver driver) : base(driver)
            {
            }
    
            public IWebElement AwaitResults => this.AwaitGetById("results");
    
            public void Navigate()
            {
                this.Driver.Navigate().GoToUrl("https://localhost:44394/Page1");
            }
        }
    }
    

    namespace StackOverflow68925623Scraper
    {
        using OpenQA.Selenium;
    
        public class Page2Model : PageModel
        {
            public Page2Model(IWebDriver driver) : base(driver)
            {
            }
    
            public IWebElement Results => this.GetById("results");
    
            public void Navigate()
            {
                this.Driver.Navigate().GoToUrl("https://localhost:44394/Page2");
            }
        }
    }
    

    And the Scraper class:

    namespace StackOverflow68925623Scraper
    {
        using OpenQA.Selenium.Chrome;
        using System;
        using System.Threading;
        using Xunit;
    
        public class Scraper
        {
            [Fact]
            public void TestPage1()
            {
                // Arrange
                var driver = new ChromeDriver();
                var page = new Page1Model(driver);
                page.Navigate();
                try
                {
                    // Act
                    var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);
    
                    // Assert
                    Assert.Equal(20000, actualResults.Length);
                }
                finally
                {
                    // Ensure the browser window closes even if things go pear-shaped
                    driver.Quit();
                }
            }
    
            [Fact]
            public void TestPage2()
            {
                // Arrange
                var driver = new ChromeDriver();
                var page = new Page2Model(driver);
                page.Navigate();
                try
                {
                    // Act
                    while (!page.Results.Text.Contains("That's all folks"))
                    {
                        Thread.Sleep(1000);
                        page.ScrollToBottom();
                        page.ScrollToTop();
                    }
    
                    var actualResults = page.Results.Text.Split(Environment.NewLine);
    
                    // Assert - we expect 1001 because of the extra "that's all folks"
                    Assert.Equal(1001, actualResults.Length);
                }
                finally
                {
                    // Ensure the browser window closes even if things go pear-shaped
                    driver.Quit();
                }
            }
        }
    }
    

    So, what's happening here?

    // Arrange
    var driver = new ChromeDriver();
    var page = new Page1Model(driver);
    page.Navigate();
    

    ChromeDriver is in the Selenium.WebDriver.ChromeDriver package and implements the IWebDriver interface from the Selenium.WebDriver package with the code to interact with the Chrome browser. Other packages are available containing implementations for all popular browsers. Instantiating the driver object opens a browser window, and calling its Navigate method directs the browser to the page we want to test/scrape.

    // Act
    var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);
    

    Because on Page1, the results element doesn't exist until all the data has been displayed, and no user interaction is required in order for it to be displayed, we use the page model's AwaitResults property to just wait for that element to appear and return it once it has appeared.

    AwaitResults returns an IWebElement instance representing the element, which in turn has various methods and properties we can use to interact with the element. In this case we use its Text property which returns the element's contents as a string, without any markup. Because the data is displayed as an unordered list, each element in the list is delimited by a line break, so we can can use String's Split method to convert it to a string array.

    Page2 needs a different approach - we can't use the presence of the results element to determine whether the data has all been displayed, because that element is on the page right from the start, instead we need to check for the string "That's all folks" which is written right at the end of the last chunk of data. Also the data isn't loaded all in one go, and we need to keep scrolling down in order to trigger the loading of the next chunk of data.

    // Act
    while (!page.Results.Text.Contains("That's all folks"))
    {
        Thread.Sleep(1000);
        page.ScrollToBottom();
        page.ScrollToTop();
    }
    
    var actualResults = page.Results.Text.Split(Environment.NewLine);
    

    Because of the bug in the UI that I mentioned earlier, if we get to the bottom of the page too quickly, the fetch of the next chunk of data isn't triggered, and attempting to scroll down when already at the bottom of the page doesn't raise another scroll event. That's why I'm scrolling to the bottom of the page and then back to the top - that way I can guarantee that a scroll event is raised. You never know, the web site you're trying to scrape data from may itself be buggy.

    Once the "That's all folks" text has appeared, we can go ahead and get the results element's Text property and convert it to a string array as before.

    // Assert - we expect 1001 because of the extra "that's all folks"
    Assert.Equal(1001, actualResults.Length);
    

    This is the bit that won't be in your code. Because I'm scraping a web site which is under my control, I know exactly how much data it should be displaying so I can check that I've got all the data, and therefore that my scraping code is working correctly.

    Further reading

    Absolute beginner's introduction to Selenium: https://www.guru99.com/selenium-csharp-tutorial.html

    (A curiosity in that article is the way that it starts by creating a console application project and later changes its output type to class library and manually adds the unit test packages, when the project could have been created using one of Visual Studio's unit test project templates. It gets to the right place in the end, albeit via a rather odd route.)

    Selenium documentation: https://www.selenium.dev/documentation/

    Happy scraping!

    这篇关于从动态服务器抓取 html 列表数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆