在 BeautifulSoup 中处理无限滚动 UI [英] Handling an infinite scroll UI in BeautifulSoup

查看:17
本文介绍了在 BeautifulSoup 中处理无限滚动 UI的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究如何抓取 Linkedin 源 (https://www.linkedin.com/mynetwork/invite-connect/connections/)但无限滚动似乎是不可能的.如何处理?我不想使用 Selenium(想稍后实现为 Web 服务).

I'm looking at how to scrape Linkedin source (https://www.linkedin.com/mynetwork/invite-connect/connections/) but it seems impossible with infinite scroll. How to deal with it? I don't want to use Selenium (want to implement as web service later on).

import bs4
from bs4 import BeautifulSoup
import requests

def scraping(webpage):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response= requests.get(str(webpage), headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")
    print(soup)

scraping('https://www.linkedin.com/mynetwork/invite-connect/connections')

推荐答案

BeautifulSoup 只能帮助处理你给它的 HTML;你需要让 LinkedIn 返回更多的 HTML.内容不在您拥有的 HTML 中,因此您必须获取它.浏览器可能正在运行 LinkedIn 的 javascript 以注意到您正在滚动,因此它需要获取更多内容并在页面中注入更多 HTML - 您需要以某种方式复制此内容获取.

BeautifulSoup can only help with the HTML you give it; you'll need to cause LinkedIn to return more HTML. The content isn't in the HTML you have, so you must get it. The browser is probably running LinkedIn's javascript to notice that you're scrolling and therefore it needs to fetch more content and inject more HTML in the page - you need to replicate this content fetch somehow.

坏消息:BeautifulSoup 不知道 API 或 javascript.您将需要另一个工具.

Bad news: BeautifulSoup isn't aware of APIs or javascript. You'll need another tool.

好消息:有工具可以做到这一点!您当然可以使用 Selenium,这可能是解决此问题的最简单方法,因为它可以很好地复制浏览器环境以实现这些目的.

Good news: there are tools for this! You could certainly use Selenium, that would probably be the simplest way to solve this, since it would replicate the browser environment pretty well for these purposes.

如果你绝对不使用 Selenium,我建议你深入 LinkedIn 网站,看看你是否能找出哪些 javascript 负责获取更多数据,并复制它们发出的网络请求,以及然后自己解析这些数据.

If you are absolutely committed to not using Selenium, I recommend you deep-dive on the LinkedIn site and see if you can figure out which bits of javascript are responsible for fetching more data, and replicate the network requests they make, and then parse that data yourself.

不过,对于大多数人来说,Selenium 将是正确的答案.

For most people, though, Selenium will be the right answer.

这篇关于在 BeautifulSoup 中处理无限滚动 UI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆