如果浏览器不支持框架+无法直接访问框架,如何自动获取框架的内容 [英] How to get contents of frames automatically if browser does not support frames + can't access frame directly

查看:552
本文介绍了如果浏览器不支持框架+无法直接访问框架,如何自动获取框架的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从类似以建立联合国决议库.

I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.

如果我使用漂亮的汤或机械化打开该URL,则会收到您的浏览器不支持框架"的信息,如果在chrome开发工具中使用复制为curl"功能,也会得到相同的结果.

If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.

使用机械化或精美的汤时,您的浏览器不支持框架"的标准建议是跟踪每个框架的来源并加载该框架.但是,如果这样做,我会收到一条错误消息,指出该页面不是

The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.

我该如何进行?我想我可以用僵尸或幻象来尝试这种方法,但是我不愿意使用那些工具,因为我对它们并不熟悉.

How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.

推荐答案

好吧,这是与 requests BeautifulSoup

Okay, this was an interesting task to do with requests and BeautifulSoup.

有一系列对un.orgdaccess-ods.un.org的基础调用,这些调用很重要并设置了相关的cookie.这就是为什么您需要维护 requests.Session() 并在访问pdf之前先访问几个URL.

There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.

这是完整的代码:

import re
from urlparse import urljoin

from bs4 import BeautifulSoup
import requests


BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'

# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]

# get header
session.get(header_link, headers={'Referer': URL})

# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)

content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)

# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)

# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link

# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)

# download file
with open('document.pdf', 'wb') as handle:
    response = session.get(document_link, stream=True)

    for block in response.iter_content(1024):
        if not block:
            break

        handle.write(block)

您可能应该将单独的代码块提取到函数中,以使其更具可读性和重用性.

You should probably extract separate blocks of code into functions to make it more readable and reusable.

仅供参考,在 selenium Ghost.py 的a>.

FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.

希望有帮助.

这篇关于如果浏览器不支持框架+无法直接访问框架,如何自动获取框架的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆