由于页面无限加载,无法在 python 中通过 selenium 进行抓取 [英] Unable to scrape via selenium in python because of infinite page load

查看:27
本文介绍了由于页面无限加载,无法在 python 中通过 selenium 进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取一些新闻文章的内容.一些网址需要登录才能访问完整内容.我决定使用 selenium 自动登录.但是,我无法提取内容,因为第一个 url 需要永远加载并且永远不会到达完成实际文本提取的点.它最终抛出超时异常.

I am trying to extract the contents of some of the news articles. Some of the urls required logging in in order to access the full content. I decided to use selenium to automate logging in. However, I am not able to extract contents because the first url takes forever to load and never reaches the point where actual text extraction is done. It ends up throwing timeout exception.

这是我的代码

for url in url_list:
    chrome_options = Options()
    ua = UserAgent()
    userAgent = ua.random
    options.add_argument(f'user-agent={userAgent}')
    driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
    driver.get(url)
    time.sleep(5)
    frame = driver.find_elements_by_xpath('//iframe[@id="wallIframe"]')
    #Some articles require going through a paywall and some don't
    if len(frame)==0:
        text_element = driver.find_elements_by_xpath('//section[@id="main-content"]//article//p')
        text = " ".join(x.text for x in element)
    else:
        text = log_in(frame)
    driver.quit()

虽然代码从未触及它,但这是我的登录方法

Although the code never reaches to it, here is my log_in method

def log_in(frame):
    driver.switch_to.frame(frame[0])
    driver.find_element_by_id("PAYWALL_V2_SIGN_IN").click()
    time.sleep(2)
    driver.find_elements_by_id("username")[0].send_keys(username)
    time.sleep(2)
    driver.find_elements_by_xpath('//button[text()="Continue"]')[0].click()
    time.sleep(1)
    driver.find_elements_by_id("password")[0].send_keys(password)
    time.sleep(1)
    element = driver.find_elements_by_xpath('//button[@type="submit"]')[0].click()
    time.sleep(1)
    text = parse_text(element)

我该如何解决这个问题?

How can I get around this?

推荐答案

您应该使用 WebDriverWait,而不是使用 time.sleep 手动设置超时/strong> 和 expected_conditions;这样,只有在满足特定条件(例如,元素可见或元素可点击)时,才会对元素执行的操作.

Instead of manually setting the timeout with time.sleep, you should use WebDriverWait along with expected_conditions; this way the action to be done on your element will be performed only when a certain condition is satisfied (for example if the element is visible or if the element is clickable).

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

try:
    frame = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//iframe[@id="wallIframe"]')))

except TimeoutException:
    print "Element not found."

这篇关于由于页面无限加载,无法在 python 中通过 selenium 进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆