蟒蛇beautifulsoup iframe的HTML文件摘录 [英] python beautifulsoup iframe document html extract

查看：809 发布时间：2016/8/5 18:58:49 python html iframe beautifulsoup

本文介绍了蟒蛇beautifulsoup iframe的HTML文件摘录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想了解一些美丽的汤，并得到一些HTML数据了一些iFrame中的 - 但我不是很成功，到目前为止

I am trying to learn a bit of beautiful soup, and to get some html data out of some iFrames - but I have not been very successful so far.

所以，本身解析的iFrame似乎并不与BS4一个问题，但我似乎并没有从这个嵌入的内容 - 无论我做什么

So, parsing the iFrame in itself does not seem to be a problem with BS4, but I do not seem to get the embedded content from this - whatever I do.

例如，请考虑下面的iFrame（这是我看到的Chrome开发者工具）：

For example, consider the below iFrame (this is what I see on chrome developer tools):

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>

其中，＆LT; HTML和GT; ...＆LT; / HTML＆GT; 是我感兴趣的提取内容。

where, <html>...</html> is the content I am interested in extracting.

然而，当我用下面的BS4 code：

However, when I use the following BS4 code:

iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
    iFrames.append(soup.iframe.extract())

我得到：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">

在换句话说，我得到的iFrame中没有文档＆LT; HTML＆GT; ...＆LT; / HTML方式＆gt; 在其中

In other words, I get the iFrames without the document <html>...</html> within them.

我试过线沿线的东西：

iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
    print iframe.find_all('html')

..但是这似乎并没有工作。

.. but this does not seem to work..

所以，我想我的问题是，如何可靠地提取这些文档对象＆LT; HTML＆GT; ...＆LT; / HTML方式＆gt; 从IFRAME元素

So, I guess my question is, how do I reliably extract these document objects <html>...</html> from the iFrame elements.

推荐答案

浏览器加载iframe中的内容的在一个单独的请求的。你必须做同样的：

Browsers load the iframe content in a separate request. You'll have to do the same:

for iframe in iframexx:
    response = urllib2.urlopen(iframe.attrs['src'])
    iframe_soup = BeautifulSoup(response)

记住：BeautifulSoup是不是浏览器;它不会为你要么获取图像，CSS和JavaScript资源。

Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.

这篇关于蟒蛇beautifulsoup iframe的HTML文件摘录的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

蟒蛇beautifulsoup iframe的HTML文件摘录 [英] python beautifulsoup iframe document html extract

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

蟒蛇beautifulsoup iframe的HTML文件摘录 [英] python beautifulsoup iframe document html extract

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭