如何使用BeautifulSoup获取完整的网址 [英] How to get full web address with BeautifulSoup
问题描述
我找不到如何获取网站的完整地址:我得到例如"/wiki/Main_Page"而不是" https://en.wikipedia.org/wiki/Main_Page ".我不能简单地将URL添加到链接中,因为它会给出:" https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page "是不正确的.我的目标是使它适用于任何网站,因此我正在寻找一种通用的解决方案.
I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution.
这是代码:
from bs4 import BeautifulSoup
import requests
url ="https://en.wikipedia.org/wiki/WKIK"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a', href=True):
print "Found the URL:", link['href']
这里是返回内容的一部分:
Here is a part of what it returns :
>Found the URL: /wiki/WKIK_(AM)
>Found the URL: /wiki/WKIK-FM
>Found the URL: /wiki/File:Disambig_gray.svg
>Found the URL: /wiki/Help:Disambiguation
>Found the URL: //en.wikipedia.org/w/index.php?
>title=Special:WhatLinksHere/WKIK&namespace=0
推荐答案
当您从element中获取链接时,href属性.几乎总是会获得/wiki/Main_Page之类的链接.
When you are taking links from element, href attribute .You will almost always get link like /wiki/Main_Page.
因为基本网址始终是相同的" https://en.wikipedia.org ".因此,您需要做的是:
Because the base url is always the same 'https://en.wikipedia.org'. So what you need is to do is:
base_url = 'https://en.wikipedia.org'
search_url ="https://en.wikipedia.org/wiki/WKIK"
r = requests.get(search_url)
data = r.content
soup = BeautifulSoup(data)
for link in soup.find_all('a', href=True):
print ("Found the URL:", link['href'])
if link['href'] != '#' and link['href'].strip() != '':
final_url = base_url + link['href']
这篇关于如何使用BeautifulSoup获取完整的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!