访问第一个元素后无法通过 xpaths 在循环中访问剩余元素-Webscraping Selenium Python [英] Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

查看:30
本文介绍了访问第一个元素后无法通过 xpaths 在循环中访问剩余元素-Webscraping Selenium Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 sciencedirect 网站上抓取数据.我试图通过创建一个 xpath 列表并循环它们来一个接一个地访问期刊问题来自动化抓取过程.当我运行循环时,我无法在访问第一个日志后访问其余元素.这个过程在另一个网站上对我有用,但在这个网站上不起作用.

Im trying to scrape data from sciencedirect website. Im trying to automate the scraping process by accessing the journal issues one after the other by creating a list of xpaths and looping them. when im running the loop im unable to access the rest of the elements after accessing the first journal. This process worked for me on another website but not on this.

我还想知道除了这个过程之外,还有没有更好的方法来访问这些元素.

I also wanted to know is there any better way to access these elements apart from this process.

#Importing libraries
 import requests
 import os
 import json
 from selenium import webdriver
 import pandas as pd
 from bs4 import BeautifulSoup  
 import time
 import requests
 from time import sleep

 from selenium.webdriver.common.by import By
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support import expected_conditions as EC

 #initializing the chromewebdriver|
 driver=webdriver.Chrome(executable_path=r"C:/selenium/chromedriver.exe")

 #website to be accessed
 driver.get("https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues")

 #generating the list of xpaths to be accessed one after the other
 issues=[]
 for i in range(0,20):
     docs=(str(i))
     for j in range(1,7):
         sets=(str(j))
         con=("//*[@id=")+('"')+("0-accordion-panel-")+(docs)+('"')+("]/section/div[")+(sets)+("]/a")
         issues.append(con)

 #looping to access one issue after the other
 for i in issues:
     try:
         hat=driver.find_element_by_xpath(i)
         hat.click()
         sleep(4)
         driver.back()
     except:
         print("no more issues",i)

推荐答案

从 sciencedirect 网站抓取数据 https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues 您可以执行以下步骤:

To scrape data from sciencedirect website https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues you can perform the following steps:

  • 首先打开所有的手风琴.

  • First open all the accordions.

然后在调整项 TAB 中打开每个问题 使用 Ctrl + click().

Then open each issue in the adjustant TAB using Ctrl + click().

下一步 switch_to() 新打开的标签 并抓取所需的内容.

Next switch_to() the newly opened tab and scrape the required contents.

代码块:

  from selenium import webdriver
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.common.action_chains import ActionChains
  from selenium.webdriver.common.keys import Keys

  options = webdriver.ChromeOptions() 
  options.add_argument("start-maximized")
  options.add_experimental_option("excludeSwitches", ["enable-automation"])
  options.add_experimental_option('useAutomationExtension', False)
  driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
  driver.get('https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues')
  accordions = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.accordion-panel.js-accordion-panel>button.accordion-panel-title>span")))
  for accordion in accordions:
      ActionChains(driver).move_to_element(accordion).click(accordion).perform()
  issues = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.anchor.js-issue-item-link.text-m span.anchor-text")))
  windows_before  = driver.current_window_handle
  for issue in issues:
      ActionChains(driver).key_down(Keys.CONTROL).click(issue).key_up(Keys.CONTROL).perform()
      WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
      windows_after = driver.window_handles
      new_window = [x for x in windows_after if x != windows_before][0]
      driver.switch_to_window(new_window)
      WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a#journal-title>span")))
      print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//h2"))).get_attribute("innerHTML"))
      driver.close()
      driver.switch_to_window(windows_before)
  driver.quit()

  • 控制台输出:

  • Console Output:

      Institutions, Governance and Finance in a Globally Connected Environment
      Volume 58
      Corporate Governance in Multinational Enterprises
      .
      .
      .
    

  • 您可以在以下位置找到一些相关的详细讨论:

    You can find a couple of relevant detailed discussions in:

    这篇关于访问第一个元素后无法通过 xpaths 在循环中访问剩余元素-Webscraping Selenium Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆