在Python中抓取绝对URL而不是相对路径 [英] Scrape the absolute URL instead of a relative path in python

查看：742 发布时间：2020/5/8 1:02:15 python beautifulsoup mechanize

本文介绍了在Python中抓取绝对URL而不是相对路径的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从HTML代码中获取所有href，并将其存储在列表中以供将来处理，例如:

I'm trying to get all the href's from a HTML code and store it in a list for future processing such as this:

示例网址:www.example-page-xl.com

Example URL: www.example-page-xl.com

 <body>
    <section>
    <a href="/helloworld/index.php"> Hello World </a>
    </section>
 </body>

我正在使用以下代码列出href的内容:

I'm using the following code to list the href's:

import bs4 as bs4
import urllib.request

sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print(url.get('href'))

但是我要将URL存储为: www.example-page-xl.com/helloworld/index.php，而不仅仅是/helloworld/index.php的相对路径

However I would like to store the URL as: www.example-page-xl.com/helloworld/index.php and not just the relative path which is /helloworld/index.php

不需要在URL上附加/加入相对路径，因为当我加入URL和相对路径时动态链接可能会有所不同.

Appending/joining the URL with the relative path isn't required since the dynamic links may vary when I join the URL and the relative path.

简而言之，我想抓取绝对URL，而不是仅抓取相对路径(并且不加入)

In a nutshell I would like to scrape the absolute URL and not relative paths alone (and without joining)

推荐答案

在这种情况下， urlparse.urljoin 会为您提供帮助.您应该像这样修改代码-

In this case urlparse.urljoin helps you. You should modify your code like this-

import bs4 as bs4
import urllib.request
from urlparse import  urljoin

web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print urljoin(web_url,url.get('href'))

此处 urljoin 管理绝对路径和相对路径.

here urljoin manage absolute and relative paths.

这篇关于在Python中抓取绝对URL而不是相对路径的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中抓取绝对URL而不是相对路径 [英] Scrape the absolute URL instead of a relative path in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中抓取绝对URL而不是相对路径 [英] Scrape the absolute URL instead of a relative path in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭