Python BeautifulSoup提取特定的URL [英] Python BeautifulSoup Extract specific URLs

查看:200
本文介绍了Python BeautifulSoup提取特定的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否只能获取特定的URL?

Is it possible to get only specific URLs?

赞:

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

输出应仅是来自http://www.iwashere.com/

的输出网址:

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

我是通过字符串逻辑做到的.是否有任何使用BeautifulSoup的直接方法?

I did it by string logic. Is there any direct method using BeautifulSoup?

推荐答案

您可以匹配多个方面,包括对属性值使用正则表达式:

You can match multiple aspects, including using a regular expression for the attribute value:

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

与之匹配(例如):

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

因此具有href属性且其值以字符串http://www.iwashere.com/开头的任何<a>标记.

so any <a> tag with a href attribute that has a value that starts with the string http://www.iwashere.com/.

您可以遍历结果并仅选择href属性:

You can loop over the results and pick out just the href attribute:

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

要匹配所有相对路径,请使用否定的超前断言来测试该值是否以schem(例如http:mailto:)或双斜杠开头 not (//hostname/path);任何这样的值必须都应是相对路径:

To match all relative paths instead, use a negative look-ahead assertion that tests if the value does not start with a schem (e.g. http: or mailto:), or a double slash (//hostname/path); any such value must be a relative path instead:

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

这篇关于Python BeautifulSoup提取特定的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆