python beautifulsoup iframe 文档 html 提取 [英] python beautifulsoup iframe document html extract

查看:20
本文介绍了python beautifulsoup iframe 文档 html 提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习一些漂亮的汤,并从一些 iFrame 中获取一些 html 数据 - 但到目前为止我还不是很成功.

I am trying to learn a bit of beautiful soup, and to get some html data out of some iFrames - but I have not been very successful so far.

因此,解析 iFrame 本身似乎不是 BS4 的问题,但我似乎没有从中获得嵌入的内容 - 无论我做什么.

So, parsing the iFrame in itself does not seem to be a problem with BS4, but I do not seem to get the embedded content from this - whatever I do.

例如,考虑下面的 iFrame(这是我在 chrome 开发者工具上看到的):

For example, consider the below iFrame (this is what I see on chrome developer tools):

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>

其中,<html>...</html> 是我有兴趣提取的内容.

where, <html>...</html> is the content I am interested in extracting.

但是,当我使用以下 BS4 代码时:

However, when I use the following BS4 code:

iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
    iFrames.append(soup.iframe.extract())

我明白了:

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">

换句话说,我得到的 iFrame 中没有 ... 文档.

In other words, I get the iFrames without the document <html>...</html> within them.

我尝试了以下内容:

iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
    print iframe.find_all('html')

.. 但这似乎不起作用..

.. but this does not seem to work..

所以,我想我的问题是,我如何可靠地从 iFrame 元素中提取这些文档对象 <html>...</html>.

So, I guess my question is, how do I reliably extract these document objects <html>...</html> from the iFrame elements.

推荐答案

浏览器加载 iframe 内容在单独的请求中.你也必须这样做:

Browsers load the iframe content in a separate request. You'll have to do the same:

for iframe in iframexx:
    response = urllib2.urlopen(iframe.attrs['src'])
    iframe_soup = BeautifulSoup(response)

记住:BeautifulSoup 不是浏览器;它也不会为您获取图像、CSS 和 JavaScript 资源.

Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.

这篇关于python beautifulsoup iframe 文档 html 提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆