使用硒和使用 By.SELECTOR 使用双循环进行网页抓取 [英] Web scraping with a double loop with selenium and using By.SELECTOR

查看:31
本文介绍了使用硒和使用 By.SELECTOR 使用双循环进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从该网站获取 pdf 文件.我正在尝试创建一个双循环,以便我可以滚动多年(季节)以获取每年的所有主要 pdf.

这行代码不起作用.问题是,我无法让这条线工作(应该多年来一直循环的那条线(季节):

 for year in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#season a aria-valuetext"))):年.click()

这是完整的代码:

 os.chdir("C:...")driver = webdriver.Chrome("chromedriver.exe")等待 = WebDriverWait(驱动程序,10)driver.get("http://www.motogp.com/en/Results+Statistics/")链接 = []在 wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#season a aria-valuetext"))) 中的年份:年.click()对于 wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))) 中的项目:item.click()elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))打印(elem.get_attribute(href"))links.append(elem.get_attribute("href"))等到(EC.staleness_of(elem))驱动程序退出()

这是我以前的帖子,我在上面的代码中得到了帮助:

I am trying to get the pdf files from this website. I am trying to create a double loop so I can scroll over the years (Season) to get all the main pdf located in each year.

The line of code is not working is this one. The problem is, I can not make this line work (The one that is supposed to loop all over the years (Season):

for year in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#season a aria-valuetext"))):
 year.click() 

This is the full code:

  os.chdir("C:..")
    driver = webdriver.Chrome("chromedriver.exe")
    wait = WebDriverWait(driver, 10)
    driver.get("http://www.motogp.com/en/Results+Statistics/")
    links = []


    for year in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#season a aria-valuetext"))):
     year.click()                                                          
     for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):
         item.click()
         elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))
         print(elem.get_attribute("href"))
         links.append(elem.get_attribute("href"))
         wait.until(EC.staleness_of(elem))

    driver.quit()

This is a previous post where I got help with the code above:

Scraping pdfs from this web

解决方案

The solution below should work for you. First, we iterate over the # of years in the CSS slider. Then we work the list using your code example. Added a sleep command because I kept getting a timeout.

CODE

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome("chromedriver.exe")
wait = WebDriverWait(driver, 10)
driver.get("http://www.motogp.com/en/Results+Statistics/")

slider = driver.find_element_by_xpath('//*[@id="handle_season"]')

for year in range(68):
    wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="event"]')))    
    for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):
        item.click()
        elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))
        print(elem.get_attribute("href"))
        wait.until(EC.staleness_of(elem))

    slider.send_keys(Keys.ARROW_LEFT)
    time.sleep(1)


driver.quit()

Result:

这篇关于使用硒和使用 By.SELECTOR 使用双循环进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆