无限滚动抓取网站 [英] scrape websites with infinite scrolling

查看:67
本文介绍了无限滚动抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了很多刮板,但我不太确定如何处理无限滚动.如今,大多数网站等 Facebook、Pinterest 都有无限滚动条.

I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.

推荐答案

您可以使用 selenium 废弃无限滚动的网站,例如 twitter 或 facebook.

You can use selenium to scrap the infinite scrolling website like twitter or facebook.

第 1 步:使用 pip 安装 Selenium

Step 1 : Install Selenium using pip

pip install selenium 

第 2 步:使用下面的代码自动无限滚动并提取源代码

Step 2 : use the code below to automate infinite scroll and extract the source code

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

第 3 步:如果需要,打印数据.

Step 3 : Print the data if required.

这篇关于无限滚动抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆