由于页面无限加载,无法在 python 中通过 selenium 进行抓取 [英] Unable to scrape via selenium in python because of infinite page load
问题描述
我正在尝试提取一些新闻文章的内容.一些网址需要登录才能访问完整内容.我决定使用 selenium 自动登录.但是,我无法提取内容,因为第一个 url 需要永远加载并且永远不会到达完成实际文本提取的点.它最终抛出超时异常.
I am trying to extract the contents of some of the news articles. Some of the urls required logging in in order to access the full content. I decided to use selenium to automate logging in. However, I am not able to extract contents because the first url takes forever to load and never reaches the point where actual text extraction is done. It ends up throwing timeout exception.
这是我的代码
for url in url_list:
chrome_options = Options()
ua = UserAgent()
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get(url)
time.sleep(5)
frame = driver.find_elements_by_xpath('//iframe[@id="wallIframe"]')
#Some articles require going through a paywall and some don't
if len(frame)==0:
text_element = driver.find_elements_by_xpath('//section[@id="main-content"]//article//p')
text = " ".join(x.text for x in element)
else:
text = log_in(frame)
driver.quit()
虽然代码从未触及它,但这是我的登录方法
Although the code never reaches to it, here is my log_in method
def log_in(frame):
driver.switch_to.frame(frame[0])
driver.find_element_by_id("PAYWALL_V2_SIGN_IN").click()
time.sleep(2)
driver.find_elements_by_id("username")[0].send_keys(username)
time.sleep(2)
driver.find_elements_by_xpath('//button[text()="Continue"]')[0].click()
time.sleep(1)
driver.find_elements_by_id("password")[0].send_keys(password)
time.sleep(1)
element = driver.find_elements_by_xpath('//button[@type="submit"]')[0].click()
time.sleep(1)
text = parse_text(element)
我该如何解决这个问题?
How can I get around this?
推荐答案
您应该使用 WebDriverWait
,而不是使用 time.sleep
手动设置超时/strong> 和 expected_conditions
;这样,只有在满足特定条件(例如,元素可见或元素可点击)时,才会对元素执行的操作.
Instead of manually setting the timeout with time.sleep
, you should use WebDriverWait
along with expected_conditions
; this way the action to be done on your element will be performed only when a certain condition is satisfied (for example if the element is visible or if the element is clickable).
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
try:
frame = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//iframe[@id="wallIframe"]')))
except TimeoutException:
print "Element not found."
这篇关于由于页面无限加载,无法在 python 中通过 selenium 进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!