等待网页完全加载,然后使用python请求抓取 [英] Wait for Webpage to fully load before scraping with python requests

查看:1678
本文介绍了等待网页完全加载,然后使用python请求抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试从LinkedIn上的特定页面抓取数据.我有一个能够登录LinkedIn的脚本,但是当我尝试访问包含数据的页面时遇到了一个麻烦.当我调用requests.get(data_url)时,最后得到的是LinkedIn加载屏幕的html,该屏幕在LinkedIn加载实际页面内容之前显示.有没有一种方法可以使请求在实际抓取html数据之前等待LinkedIn显示站点数据?我基本上需要让页面完全呈现,然后才能获取"内容.我当前的脚本在下面.

I'm currently attempting to scrape data from a specific page on LinkedIn. I have a script that is able to log into LinkedIn, but I run into a snag when I try to access the page containing the data. When I call requests.get(data_url), I end up with the html for the LinkedIn loading screen that is displayed before LinkedIn loads the actual page content. Is there a way to make requests wait for LinkedIn to display the site data before actually scraping the html data? I essentially need to let the page fully render before I can 'get' the contents. My current script is below.

import requests
from bs4 import BeautifulSoup

client = requests.Session()

HOMEPAGE_URL = 'https://www.linkedin.com'
LOGIN_URL = 'https://www.linkedin.com/uas/login-submit'

html = client.get(HOMEPAGE_URL).content
soup = BeautifulSoup(html)
csrf = soup.find(id="loginCsrfParam-login")['value']

login_information = {
    'session_key':'EMAIL',
    'session_password':'PASSWORD',
    'loginCsrfParam': csrf,
}

client.post(LOGIN_URL, data=login_information)

r = client.get(data_url)

推荐答案

如果网页的任何部分都是动态呈现的(例如,使用Javascript),beautifulsoup可能无法使用它.

If any parts of the web page is rendered dynamically, for example using Javascript, beautifulsoup might not be able to work with that.

我使用Selenium + PhantomJS.我加载页面(等待它完全加载),然后输入登录详细信息. Selenium具有不错的API,可让您以编程方式检查特定的html元素并等待它们出现,这在这种情况下非常有用.

I use Selenium + PhantomJS. I load the page (wait for it to fully load) and then enter the login details. Selenium has nice API which lets you programmatically check for specific html elements and wait for them to appear which is very useful in such cases.

这篇关于等待网页完全加载,然后使用python请求抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆