访问第一个元素后无法通过 xpaths 在循环中访问剩余元素-Webscraping Selenium Python [英] Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python
问题描述
我正在尝试从 sciencedirect 网站上抓取数据.我试图通过创建一个 xpath 列表并循环它们来一个接一个地访问期刊问题来自动化抓取过程.当我运行循环时,我无法在访问第一个日志后访问其余元素.这个过程在另一个网站上对我有用,但在这个网站上不起作用.
Im trying to scrape data from sciencedirect website. Im trying to automate the scraping process by accessing the journal issues one after the other by creating a list of xpaths and looping them. when im running the loop im unable to access the rest of the elements after accessing the first journal. This process worked for me on another website but not on this.
我还想知道除了这个过程之外,还有没有更好的方法来访问这些元素.
I also wanted to know is there any better way to access these elements apart from this process.
#Importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time
import requests
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#initializing the chromewebdriver|
driver=webdriver.Chrome(executable_path=r"C:/selenium/chromedriver.exe")
#website to be accessed
driver.get("https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues")
#generating the list of xpaths to be accessed one after the other
issues=[]
for i in range(0,20):
docs=(str(i))
for j in range(1,7):
sets=(str(j))
con=("//*[@id=")+('"')+("0-accordion-panel-")+(docs)+('"')+("]/section/div[")+(sets)+("]/a")
issues.append(con)
#looping to access one issue after the other
for i in issues:
try:
hat=driver.find_element_by_xpath(i)
hat.click()
sleep(4)
driver.back()
except:
print("no more issues",i)
推荐答案
从 sciencedirect 网站抓取数据 https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues 您可以执行以下步骤:
To scrape data from sciencedirect website https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues you can perform the following steps:
首先打开所有的手风琴.
First open all the accordions.
然后在调整项 TAB 中打开每个问题 使用 Ctrl + click()
.
Then open each issue in the adjustant TAB using Ctrl + click()
.
下一步 switch_to()
新打开的标签 并抓取所需的内容.
Next switch_to()
the newly opened tab and scrape the required contents.
代码块:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues')
accordions = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.accordion-panel.js-accordion-panel>button.accordion-panel-title>span")))
for accordion in accordions:
ActionChains(driver).move_to_element(accordion).click(accordion).perform()
issues = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.anchor.js-issue-item-link.text-m span.anchor-text")))
windows_before = driver.current_window_handle
for issue in issues:
ActionChains(driver).key_down(Keys.CONTROL).click(issue).key_up(Keys.CONTROL).perform()
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
windows_after = driver.window_handles
new_window = [x for x in windows_after if x != windows_before][0]
driver.switch_to_window(new_window)
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a#journal-title>span")))
print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//h2"))).get_attribute("innerHTML"))
driver.close()
driver.switch_to_window(windows_before)
driver.quit()
控制台输出:
Console Output:
Institutions, Governance and Finance in a Globally Connected Environment
Volume 58
Corporate Governance in Multinational Enterprises
.
.
.
您可以在以下位置找到一些相关的详细讨论:
You can find a couple of relevant detailed discussions in:
- 如何使用 Control + Click of Selenium Webdriver 在同一窗口的新选项卡中的主选项卡中打开嵌入在 webelement 中的链接
- 如何在一个 webtable 中打开多个 hrefs 来抓取 selenium
- 在 Python 中使用 Selenium 进行网页抓取 JavaScript 呈现的内容
- 即使添加了 StaleElementReferenceException在使用网络抓取从维基百科收集数据时等待
- 如何在新标签页中打开网站内的每个产品,以便通过 Python 使用 Selenium 进行抓取
这篇关于访问第一个元素后无法通过 xpaths 在循环中访问剩余元素-Webscraping Selenium Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!