当网页无限滚动时,如何使Apify Crawler滚动整页? [英] How to make the Apify Crawler to scroll full page when web page have infinite scrolling?

查看:90
本文介绍了当网页无限滚动时,如何使Apify Crawler滚动整页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,即我无法使用产品目录页面上的延迟加载来获取网站上的所有产品数据.这意味着它需要滚动直到整个页面加载完毕.

I'm facing a problem that I unable to get all the product data as the website using a lazy load on product catalog page. meaning it needs to scroll until the whole page loaded.

我只获得首页产品数据.

I getting only first-page products data.

推荐答案

首先,您应该记住,有无限的方法可以实现无限滚动.有时,您必须在途中单击按钮或进行任何形式的转换.在这里,我将仅介绍最简单的用例,该用例将以一定的间隔向下滚动,并在没有新产品加载时结束.

First, you should keep in mind that there are infinite ways that infinite scroll can be implemented. Sometimes you have to click buttons on the way or do any sort of transitions. I will cover only the most simple use-case here which is scrolling down with some interval and finishing when no new products are loaded.

  1. 如果您使用 Apify SDK 构建自己的角色,则可以使用 infiniteScroll helper实用程序功能.如果它不能满足您的用例,最好在 Github 上给我们反馈.

  1. If you build your own actor using Apify SDK, you can use infiniteScroll helper utility function. If it doesn't cover your use-case, ideally please give us feedback on Github.

如果您使用的是通用抓取工具(网络抓取工具 Puppeteer Scraper ),目前尚未内置无限滚动功能(但如果您以后再阅读此功能,则可能没有此功能)).另一方面,自己实现它并不那么复杂,让我向您展示Web Scraper的 pageFunction 的简单解决方案.

If you are using generic Scrapers (Web Scraper or Puppeteer Scraper), the infinite scroll functionality is not currently built-in (but maybe if you read this in the future). On the other hand, it is not that complicated to implement it yourself, let me show you a simple solution for Web Scraper's pageFunction.

async function pageFunction(context) {
    // few utilities
    const { request, log, jQuery } = context;
    const $ = jQuery;

    // Here we define the infinite scroll function, it has to be defined inside pageFunction
    const infiniteScroll = async (maxTime) => {
        const startedAt = Date.now();
        let itemCount = $('.my-class').length; // Update the selector
        while (true) {
            log.info(`INFINITE SCROLL --- ${itemCount} items loaded --- ${request.url}`)
            // timeout to prevent infinite loop
            if (Date.now() - startedAt > maxTime) {
                return;
            }
            scrollBy(0, 9999);
            await context.waitFor(5000); // This can be any number that works for your website
            const currentItemCount = $('.my-class').length; // Update the selector

            // We check if the number of items changed after the scroll, if not we finish
            if (itemCount === currentItemCount) {
                return;
            }
            itemCount = currentItemCount;
        }
    }

    // Generally, you want to do the scrolling only on the category type page
    if (request.userData.label === 'CATEGORY') {
        await infiniteScroll(60000); // Let's try 60 seconds max

        // ... Add your logic for categories
    } else {
        // Any logic for other types of pages
    }
}

当然,这是一个非常简单的例子.有时会变得更加复杂.我什至曾经使用Puppeteer直接导航鼠标并拖动一些可通过编程方式访问的滚动条.

Of course, this is a really trivial example. Sometimes it can get much more complicated. I even once used Puppeteer to navigate my mouse directly and drag some scroll bar that was accessible programmatically.

这篇关于当网页无限滚动时,如何使Apify Crawler滚动整页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆