Python BeautifulSoup 提取特定 URL [英] Python BeautifulSoup Extract specific URLs
问题描述
是否可以只获取特定的 URL?
喜欢:
<a href="http://www.iwashere.com/washere.html">next</a><span class="class">...</span><a href="http://www.heelo.com/hello.html">下一步</a><span class="class">...</span><a href="http://www.iwashere.com/wasnot.html">下一步</a><span class="class">...</span>
输出应该只是来自 http://www.iwashere.com/
喜欢,输出网址:
http://www.iwashere.com/washere.htmlhttp://www.iwashere.com/wasnot.html
我是通过字符串逻辑完成的.有没有直接使用BeautifulSoup的方法?
您可以匹配多个方面,包括对属性值使用正则表达式:
导入重新汤.find_all('a', href=re.compile('http://www.iwashere.com/'))
匹配(例如):
[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]
所以任何带有 href
属性的 标签,其值以字符串
http://www.iwashere.com/开头
.
您可以遍历结果并仅选择 href
属性:
要匹配所有相对路径,请使用否定前瞻断言来测试值是否不以方案开头(例如 http:
或 mailto:
) 或双斜线 (//hostname/path
);任何这样的值必须是相对路径:
soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))
Is it possible to get only specific URLs?
Like:
<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>
Output should be only URLs from http://www.iwashere.com/
like, output URLs:
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html
I did it by string logic. Is there any direct method using BeautifulSoup?
You can match multiple aspects, including using a regular expression for the attribute value:
import re
soup.find_all('a', href=re.compile('http://www.iwashere.com/'))
which matches (for your example):
[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]
so any <a>
tag with a href
attribute that has a value that starts with the string http://www.iwashere.com/
.
You can loop over the results and pick out just the href
attribute:
>>> for elem in soup.find_all('a', href=re.compile('http://www.iwashere.com/')):
... print elem['href']
...
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html
To match all relative paths instead, use a negative look-ahead assertion that tests if the value does not start with a schem (e.g. http:
or mailto:
), or a double slash (//hostname/path
); any such value must be a relative path instead:
soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))
这篇关于Python BeautifulSoup 提取特定 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!