BeautifulSoup在某些元素中选择具有特定类的所有href [英] BeautifulSoup select all href in some element with specific class

查看:253
本文介绍了BeautifulSoup在某些元素中选择具有特定类的所有href的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从

I'm trying to scrap images from this website. I tried with Scrapy(using Docker)and with scrapy/slenium. Scrapy seems not to work in windows10 home so I'm now trying with Selenium/Beautifulsoup. I'm using Python 3.6 with Spider into an Anaconda env.

这就是我需要的href元素的样子:

This is how the href elements I need look like:

<a class="emblem" href="detail/emblem/av1615001">

我要面对的主要问题是:
-如何使用Beautifulsoup选择href?在我的代码下面,您可以看到我尝试过的内容(但没有起作用)
-由于有可能观察到href只是URL的部分路径...我应如何处理此问题?

I have to major problems:
- how should I select href with Beautifulsoup? Below in my code, you can see what I tried (but didn't work)
- As it is possible to observe the href is only a partial path to url...how should I deal with this issue?

到目前为止,这里是我的代码:

Here my code so far:

from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import urllib 
import requests
from os.path  import basename


def start_requests(self):
        self.driver = webdriver.Firefox("C:/Anaconda3/envs/scrapy/selenium/webdriver")
        #programPause = input("Press the <ENTER> key to continue...")
        self.driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        html = self.driver.page_source

        #html = requests.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        soup = BeautifulSoup(html, "html.parser")        
        emblemshref = soup.select("a", {"class" : "emblem", "href" : True})

        for href in emblemshref:
            link = href["href"]
            with open(basename(link)," wb") as f:
                f.write(requests.get(link).content)

        #click on "next>>"         
        while True:
            try:
                next_page = self.driver.find_element_by_xpath("//a[@id='next']")
                sleep(3)
                self.logger.info('Sleeping for 3 seconds')
                next_page.click()

                #here again the same emblemshref loop 

            except NoSuchElementException:
                #execute next on the last page
                self.logger.info('No more pages to load') 
                self.driver.quit()
                break 

推荐答案

尝试一下.它将为您提供遍历该站点中所有页面的所有URL.我已经使用Explicit Wait使其变得更快,更动态.

Try this. It will give you all the urls traversing all the pages in that site. I've used Explicit Wait to make it faster and dynamic.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
url = "http://emblematica.grainger.illinois.edu/"
wait = WebDriverWait(driver, 10)
driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".emblem")))

while True:
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select('.emblem'):
        links = url + item['href']
        print(links)

    try:
        link = driver.find_element_by_id("next")
        link.click()
        wait.until(EC.staleness_of(link))
    except Exception:
        break
driver.quit()

部分输出:

http://emblematica.grainger.illinois.edu/detail/emblem/av1615001
http://emblematica.grainger.illinois.edu/detail/emblem/av1615002
http://emblematica.grainger.illinois.edu/detail/emblem/av1615003

这篇关于BeautifulSoup在某些元素中选择具有特定类的所有href的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆