使用BeautifulSoup提取iFrame内容 [英] extract iFrame content using BeautifulSoup
问题描述
在下面的页面上->
浏览器将在单独的请求中加载iframe内容,因此您需要获取iframe src
.您可以根据需要使用硒,也可以直接刮取数据本身.这是一个示例:
导入请求汇入url ='https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/310079005&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user= true& show_reposts = false'响应= requests.get(URL)Artist = re.search(b'(?< = artist:")(.*?)(?=)',response.content).group(0).decode(" utf-8)Song = re.search(b'(?< = title:")(.*?)(?=)',response.content).group(0).decode(" utf-8)打印(%s-%s"%(艺术家,歌曲))
私人生活-失落的男孩
On the page bellow --> link, I'm trying to use BeautifulSoup
in order to extract the <a>
texts at the very bottom, i.e., 'Private Life'
and 'Lost Boy'
.
But I'm having a hard time scraping <iframe>
content.
I've learned that it requires a different request from the browser.
So I've tried:
iframexx = soup.find_all('iframe')
for iframe in iframexx:
try:
response = urllib2.urlopen(iframe)
results = BeautifulSoup(response)
print results
but that returns None
.
how do I parse the html bellow so I can fetch each a['href'].get_text()
?
Browsers will load the iframe content in a separate request, so you'll need to fetch the url that is present in the iframe src
. You can use selenium if you want, or scrape the data itself directly.
Here is an example:
import requests
import re
url = 'https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/310079005&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false'
response = requests.get(url)
Artist = re.search(b'(?<=artist":")(.*?)(?=")', response.content).group(0).decode("utf-8")
Song = re.search(b'(?<=title":")(.*?)(?=")', response.content).group(0).decode("utf-8")
print ("%s - %s" % (Artist, Song))
Private Life - Lost Boy
这篇关于使用BeautifulSoup提取iFrame内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!