即使在使用网络抓取从维基百科收集数据时添加等待后,StaleElementReferenceException [英] StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping
问题描述
我是网络抓取的新手.如果有的话,请原谅我的愚蠢错误.
I am a newbie to the web-scraping. Pardon my silly mistakes if there are any.
我一直在做一个项目,我需要一个电影列表作为我的数据.我正在尝试使用网络抓取从 wikipedia 收集数据.
I have been working on a project in which I need a list of movies as my data. I am trying to collect the data from the wikipedia using web-scraping.
以下是我的代码:
def MoviesList(years, driver):
for year in years:
driver.implicitly_wait(150)
year.click()
table = driver.find_element_by_xpath('/html/body/div[3]/div[3]/div[5]/div[1]/table[2]/tbody')
movies = table.find_elements_by_xpath('tr/td[1]/i/a')
for movie in movies:
print(movie.text)
driver.back()
years = driver.find_elements_by_partial_link_text('List of Bollywood films of')
del years[:2]
MoviesList(years, driver)
尝试从 this 页面获取年份列表并将其存储在 年
变量.然后,我遍历所有年份并尝试提取年度前 10 部电影.请参阅此参考
Trying to get the years list from this page and stored it in the years
variable. Then, I am looping through all the years and trying to extract the top-10 movies of the year. see this for reference
输出:
Tanhaji
Baaghi 3
...
...
Panga
# Top movies of the year 2020
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document (from line year.click())
预期输出:
Tanhaji
...
...
War # First movie of the year 2019
Saaho
...
...
Vikram Urvashi # Last movie of the year 1920
# Top movies of the year from 2020 to 1920
我已经提到了this和this 问题,但徒劳无功.我也试过显式等待,但没有用.
I have already referred this and this questions but it goes in vain. I have tried Explicit Wait too, but it didn't work.
我知道错误发生时,但除了添加隐式或显式等待之外,我不知道如何处理该错误.
I am aware of the error that when it occurs but I don't know how to handle that error other than adding implicit or explicit wait.
我做错了什么?如何改进此代码以获得所需的输出?
What am I doing wrong? How can I improve this code to get the desired output?
任何帮助将不胜感激.
推荐答案
从维基百科收集数据 使用 Selenium 的宝莱坞电影列表和 python 你必须诱导 WebDriverWait 用于 visibility_of_all_elements_located()
并且您可以使用以下 定位器策略:
To collect the data from the wikipedia Lists of Bollywood films using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located()
and you can use the following Locator Strategies:
注意:作为演示,此计划仅限于收集过去三 (3) 年全球最高票房部分的电影
Note: As a demonstration this program is restricted to collect the movies from the Highest worldwide gross section for the previous three(3) years only
代码块:
Code Block:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = webdriver.ChromeOptions() options.add_argument("start-maximized") driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe') driver.get("https://en.wikipedia.org/wiki/Lists_of_Bollywood_films") parent_window = driver.current_window_handle years = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.PARTIAL_LINK_TEXT, "List of Bollywood films of")))[2:5]] print(years) for year in years: driver.execute_script("window.open('" + year +"')") WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2)) windows_after = driver.window_handles new_window = [x for x in windows_after if x != parent_window][0] driver.switch_to_window(new_window) print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/caption//following::tbody[1]//td/i/a")))]) driver.close() driver.switch_to_window(parent_window) driver.quit()
控制台输出:
Console Output:
['Tanhaji', 'Baaghi 3', 'Street Dancer 3D', 'Shubh Mangal Zyada Saavdhan', 'Malang', 'Chhapaak', 'Love Aaj Kal', 'Jawaani Jaaneman', 'Thappad', 'Panga'] ['War', 'Saaho', 'Kabir Singh', 'Uri: The Surgical Strike', 'Bharat', 'Good Newwz', 'Mission Mangal', 'Housefull 4', 'Gully Boy', 'Dabangg 3'] ['Sanju', 'Padmaavat', 'Andhadhun', 'Simmba', 'Thugs of Hindostan', 'Race 3', 'Baaghi 2', 'Hichki', 'Badhaai Ho', 'Pad Man']
- 如何在一个 webtable 中打开多个 hrefs 来抓取 selenium
- 在 Python 中使用 Selenium 进行网页抓取 JavaScript 呈现的内容
- 在访问第一个元素后无法通过 xpaths 在循环中访问其余元素 - Webscraping Selenium Python
- 如何在新标签页中打开网站内的每个产品,以便通过 Python 使用 Selenium 进行抓取
您可以在以下位置找到一些相关的详细讨论:
You can find a couple of relevant detailed discussions in:
这篇关于即使在使用网络抓取从维基百科收集数据时添加等待后,StaleElementReferenceException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!