如何使用BeautifulSoup获取完整的网址 [英] How to get full web address with BeautifulSoup

查看:100
本文介绍了如何使用BeautifulSoup获取完整的网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到如何获取网站的完整地址:我得到例如"/wiki/Main_Page"而不是" https://en.wikipedia.org/wiki/Main_Page ".我不能简单地将URL添加到链接中,因为它会给出:" https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page "是不正确的.我的目标是使它适用于任何网站,因此我正在寻找一种通用的解决方案.

I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution.

这是代码:

from bs4 import BeautifulSoup
import requests

url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print "Found the URL:", link['href']

这里是返回内容的一部分:

Here is a part of what it returns :

>Found the URL: /wiki/WKIK_(AM)
>Found the URL: /wiki/WKIK-FM
>Found the URL: /wiki/File:Disambig_gray.svg
>Found the URL: /wiki/Help:Disambiguation
>Found the URL: //en.wikipedia.org/w/index.php?
>title=Special:WhatLinksHere/WKIK&namespace=0

推荐答案

当您从element中获取链接时,href属性.几乎总是会获得/wiki/Main_Page之类的链接.

When you are taking links from element, href attribute .You will almost always get link like /wiki/Main_Page.

因为基本网址始终是相同的" https://en.wikipedia.org ".因此,您需要做的是:

Because the base url is always the same 'https://en.wikipedia.org'. So what you need is to do is:

base_url = 'https://en.wikipedia.org'
search_url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(search_url)
data = r.content
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print ("Found the URL:", link['href'])
    if link['href'] != '#' and link['href'].strip() != '':
       final_url = base_url + link['href']

这篇关于如何使用BeautifulSoup获取完整的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆