Python BeautifulSoup 提取特定 URL [英] Python BeautifulSoup Extract specific URLs

查看:18
本文介绍了Python BeautifulSoup 提取特定 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以只获取特定的 URL?

喜欢:

<a href="http://www.iwashere.com/washere.html">next</a><span class="class">...</span><a href="http://www.heelo.com/hello.html">下一步</a><span class="class">...</span><a href="http://www.iwashere.com/wasnot.html">下一步</a><span class="class">...</span>

输出应该只是来自 http://www.iwashere.com/

的 URL

喜欢,输出网址:

http://www.iwashere.com/washere.htmlhttp://www.iwashere.com/wasnot.html

我是通过字符串逻辑完成的.有没有直接使用BeautifulSoup的方法?

解决方案

您可以匹配多个方面,包括对属性值使用正则表达式:

导入重新汤.find_all('a', href=re.compile('http://www.iwashere.com/'))

匹配(例如):

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

所以任何带有 href 属性的 标签,其值以字符串 http://www.iwashere.com/开头.

您可以遍历结果并仅选择 href 属性:

<预><代码>>>>对于soup.find_all('a', href=re.compile('http://www.iwashere.com/')) 中的元素:... 打印 elem['href']...http://www.iwashere.com/washere.htmlhttp://www.iwashere.com/wasnot.html

要匹配所有相对路径,请使用否定前瞻断言来测试值是否以方案开头(例如 http:mailto:) 或双斜线 (//hostname/path);任何这样的值必须是相对路径:

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

Is it possible to get only specific URLs?

Like:

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

Output should be only URLs from http://www.iwashere.com/

like, output URLs:

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

I did it by string logic. Is there any direct method using BeautifulSoup?

解决方案

You can match multiple aspects, including using a regular expression for the attribute value:

import re
soup.find_all('a', href=re.compile('http://www.iwashere.com/'))

which matches (for your example):

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

so any <a> tag with a href attribute that has a value that starts with the string http://www.iwashere.com/.

You can loop over the results and pick out just the href attribute:

>>> for elem in soup.find_all('a', href=re.compile('http://www.iwashere.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

To match all relative paths instead, use a negative look-ahead assertion that tests if the value does not start with a schem (e.g. http: or mailto:), or a double slash (//hostname/path); any such value must be a relative path instead:

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

这篇关于Python BeautifulSoup 提取特定 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆