在 Python 中使用 Selenium 抓取 JavaScript 呈现的内容 [英] WebScraping JavaScript-Rendered Content using Selenium in Python

查看:37
本文介绍了在 Python 中使用 Selenium 抓取 JavaScript 呈现的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对网页抓取非常陌生,一直在尝试使用 Selenium 的功能来模拟浏览器访问德克萨斯州公共承包网页,然后下载嵌入的 PDF.网站是这样的:

  • 即使添加了 StaleElementReferenceException在使用网络抓取从维基百科收集数据时等待
  • 在访问第一个元素后无法通过 xpaths 在循环中访问其余元素 - Webscraping Selenium Python
  • 如何在新标签页中打开网站内的每个产品,以便通过 Python 使用 Selenium 进行抓取
  • I am very new to web scraping and have been trying to use Selenium's functions to simulate a browser accessing the Texas public contracting webpage and then download embedded PDFs. The website is this: http://www.txsmartbuy.com/sp.

    So far, I've successfully used Selenium to select an option in one of the dropdown menus "Agency Name" and to click the search button. I've listed my Python code below.

    import os
    os.chdir("/Users/fsouza/Desktop") #Setting up directory
    
    from bs4 import BeautifulSoup #Downloading pertinent Python packages
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By
    
    chromedriver = "/Users/fsouza/Desktop/chromedriver" #Setting up Chrome driver
    driver = webdriver.Chrome(executable_path=chromedriver)
    driver.get("http://www.txsmartbuy.com/sp")
    delay = 3 #Seconds
    
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "//select[@id='agency-name-filter']/option[69]")))    
    health = driver.find_element_by_xpath("//select[@id='agency-name-filter']/option[68]")
    health.click()
    search = driver.find_element_by_id("spBtnSearch")
    search.click()
    

    Once I get to the results page, I get stuck.

    First, I can't access any of the resulting links using the html page source. But if I manually inspect individual links in Chrome, I do find the pertinent tags (<a href...) relating to individual results. I'm guessing this is because of JavaScript-rendered content.

    Second, even if Selenium were able to see these individual tags, they have no class or id. The best way to call them, I think, would be by calling <a tags by the order shown (see code below) but this didn't work either. Instead, the link calls some other 'visible' tag (something in the footer, which I don't need).

    Third, assuming these things did work, how can I figure out the number of <a> tags showing on the page (in order to loop this code over an over for every single result)?

    driver.execute_script("document.getElementsByTagName('a')[27].click()")
    

    I would appreciate your attention to this––and please excuse any stupidity on my part, considering that I'm just starting out.

    解决方案

    To scrape the JavaScript-Rendered Content using Selenium you need to:

    • Induce WebDriverWait for the desired element to be clickable().

    • Induce WebDriverWait for the visibility of all elements located().

    • Open each link in a new tab using Ctrl and click() through ActionChains

    • Induce WebDriverWait and switch to the new tab to webscrape.

    • Switch back to the main page.

    • Code Block:

        from selenium import webdriver
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.support import expected_conditions as EC
        from selenium.webdriver.common.action_chains import ActionChains
        from selenium.webdriver.common.keys import Keys
        import time
      
        options = webdriver.ChromeOptions() 
        options.add_argument("start-maximized")
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)
        driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
        driver.get("http://www.txsmartbuy.com/sp")
        windows_before  = driver.current_window_handle
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']"))).click()
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']//option[contains(., 'Health & Human Services Commission - 529')]"))).click()
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@id='spBtnSearch']/i[@class='icon-search']"))).click()
        for link in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/tbody//tr/td/strong/a"))):
            ActionChains(driver).key_down(Keys.CONTROL).click(link).key_up(Keys.CONTROL).perform()
            WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
            windows_after = driver.window_handles
            new_window = [x for x in windows_after if x != windows_before][0]
            driver.switch_to_window(new_window)
            time.sleep(3)
            print("Focus on the newly opened tab and here you can scrape the page")
            driver.close()
            driver.switch_to_window(windows_before)
        driver.quit()
      

    • Console Output:

        Focus on the newly opened tab and here you can scrape the page
        Focus on the newly opened tab and here you can scrape the page
        Focus on the newly opened tab and here you can scrape the page
        .
        .
      

    • Browser Snapshot:


    References

    You can find a couple of relevant detailed discussions in:

    这篇关于在 Python 中使用 Selenium 抓取 JavaScript 呈现的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆