如何处理InvalidSchema异常 [英] How to handle InvalidSchema exception

查看:89
本文介绍了如何处理InvalidSchema异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用python中的两个函数编写了一个脚本.第一个函数 get_links()从网页中获取一些链接,并将这些链接返回到另一个函数 get_info().此时,函数 get_info()应该从不同的链接产生不同的商店名称,但是会引发错误 raise InvalidSchema(未找到'%s'的连接适配器"%url).

I've written a script in python using two functions within it. The first function get_links() fetches some links from a webpage and returns those links to another function get_info(). At this point the function get_info() should produce the different shop names from different links but It throws an error raise InvalidSchema("No connection adapters were found for '%s'" % url).

这是我的尝试:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return get_info(elem)

def get_info(url):
    response = requests.get(url)
    print(response.url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

if __name__ == '__main__':
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'    
    for review in get_links(link):
        print(urljoin(link,review.get("href")))

我要在这里学习的关键是 return get_info(elem)

我创建了与此线程有关的另一个线程. return get_info(elem).链接到该线程.

I created another thread concerning this return get_info(elem). Link to that thread.

当我尝试如下操作时,我会相应地得到结果 :

When I try like the following, I get the results accordingly:

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return elem

def get_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

if __name__ == '__main__':
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'    
    for review in get_links(link):
        print(get_info(urljoin(link,review.get("href"))))

我的问题:如何使用我的第一个脚本使用 return get_info(elem) strong>

My question: how can I get the results according to the way I tried with my first script making use of return get_info(elem)?

推荐答案

检查每个函数返回的内容.在这种情况下,第一个脚本中的函数将永远不会运行.原因是 get_info 接受URL,而不接受其他任何URL.因此很明显,当您运行 get_info(elem)时,您将遇到一个错误,其中 elem 是由 soup.select().

Inspect what is returned by each function. In this case, the function in your first script will never run. The reason because get_info takes in a URL, not anything else. So obviously you are going to hit an error when you run get_info(elem) where elem is a list of items that are selected by soup.select().

尽管您应该已经了解了以上内容,因为您正在遍历第二个脚本的结果,该脚本仅返回列表以获取 href 元素.因此,如果您想在第一个脚本中使用 get_info ,将其应用于项目而不是列表,则在这种情况下可以使用列表理解.

You should already know the above though because you are iterating over the results from the second script which just returns the list to get the href elements. So if you want to use get_info in your first script, apply it on the items not the list, you can use a list comprehension in this case.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return [get_info(urljoin(link,e.get("href"))) for e in elem] 

def get_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'

for review in get_links(link): 
    print(review) 

现在您知道第一个函数仍然返回一个列表,但是将 get_info 应用于其元素,这是rite的工作方式吗? get_info 接受URL,而不是列表.从那里开始,因为您已经在 get_links 中应用了 url_join get_info ,所以可以将其循环以打印结果.

Now you know the first function still returns a list, but with get_info applied to its elements, which is how it works rite? get_info accepts a URL not a list. From there since you have already applied the url_join and get_info in get_links, you can loop it over to print the results.

这篇关于如何处理InvalidSchema异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆