requests.exceptions.MissingSchema:无效的 URL 'h':未提供架构 [英] requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

查看:39
本文介绍了requests.exceptions.MissingSchema:无效的 URL 'h':未提供架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个网页抓取项目,但遇到了以下错误.

I am working on a web scraping project and have run into the following error.

requests.exceptions.MissingSchema:无效的 URL 'h':未提供架构.也许你的意思是 http://h?

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

下面是我的代码.我从 html 表中检索所有链接,并按预期打印出来.但是当我尝试使用 request.get 遍历它们(链接)时,我得到了上面的错误.

Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)

推荐答案

你的错误是代码中的第二个 for 循环

Your mistake is second for loop in code

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href'] 为您提供单个 url,但您在下一个 for 循环中将其用作列表.

ref['href'] gives you single url but you use it as list in next for loop.

所以你有

for link in ref['href']:

它为您提供来自 url http://properties.kimcore... 的第一个字符,即 h

and it gives you first char from url http://properties.kimcore... which is h

完整的工作代码

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

顺便说一句:如果你在 (ref['href'], ) 中使用逗号,那么你会得到元组,然后第二个 for 工作正确.

BTW: if you use comma in (ref['href'], ) then you get tuple and then second for works correclty.

它在开始时创建列表 table_data 并将所有数据添加到此列表中.最后转换成DataFrame.

it create list table_data at start and add all data into this list. And it convert into DataFrame at the end.

但现在我看到它多次读取同一页面 - 因为在每一行中,每一列中都有相同的 url.您只需从一列中获取 url.

But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.

现在它不会多次读取相同的网址

now it doesn't read the same url many times

现在它从第一个链接获取文本和 hre,并在您使用 append() 时添加到列表中的每个元素.

now it get text and hre from first link and add to every element in list when you use append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

这篇关于requests.exceptions.MissingSchema:无效的 URL 'h':未提供架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆