如何使用美丽汤从页面中提取链接 [英] How to extract links from a page using Beautiful soup
问题描述
我有一个 HTML页面,其中包含多个div,例如:
I have a HTML Page with multiple divs like:
<div class="post-info-wrap">
<h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post – Example 1 Post" rel="bookmark">sample post – example 1 post</a></h2>
<div class="post-meta clearfix">
<div class="post-info-wrap">
<h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post – Example 2 Post" rel="bookmark">sample post – example 2 post</a></h2>
<div class="post-meta clearfix">
我需要使用class post-info-wrap获取所有div的值,我是BeautifulSoup的新手
and I need to get the value for all the divs with class post-info-wrap I am new to BeautifulSoup
所以我需要这些网址:
https://www.example.com/blog/111/this-is-2nd-post/
以此类推...
我尝试过:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']
此代码似乎无效.我是美丽汤的新手.如何提取链接?
This code doesnt seem to be working. I am new to beautiful soup. How can i extract the links?
推荐答案
link = i.find('a',href = True)
始终不返回 anchor标签(a)
,它可能返回 NoneType
,因此您需要验证链接是否为None,继续循环,否则获取链接href值.
link = i.find('a',href=True)
always not return anchor tag (a)
, it may be return NoneType
, so you need to validate link is None, continue for loop,else get link href value.
按网址取消链接:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
通过HTML取消链接:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post – Example 1 Post" rel="bookmark">sample post – example 1 post</a></h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post – Example 2 Post" rel="bookmark">sample post – example 2 post</a></h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
更新:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
O/P:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
对于Chrome浏览器:
http://chromedriver.chromium.org/downloads
为Chrome浏览器安装Web驱动程序:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
硒教程
https://selenium-python.readthedocs.io/
其中'/usr/bin/chromedriver'
chrome webdriver路径.
Where '/usr/bin/chromedriver'
chrome webdriver path.
这篇关于如何使用美丽汤从页面中提取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!