如何使用 urllib 跟踪重定向? [英] How to follow a redirect with urllib?

查看:46
本文介绍了如何使用 urllib 跟踪重定向?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 3 中创建了一个脚本,它可以访问如下页面:

I'm creating a script in Python 3 which access a page like:

example.com/daora/zz.asp?x=qqrzzt

example.com/daora/zz.asp?x=qqrzzt

使用 urllib.request.urlopen("example.com/daora/zz.asp?x=qqrzzt"),但是这段代码只是给了我相同的页面(example.com/daora/zz.asp?x=qqrzzt)并在浏览器上重定向到如下页面:

using the urllib.request.urlopen("example.com/daora/zz.asp?x=qqrzzt"), but this code just give me the same page(example.com/daora/zz.asp?x=qqrzzt) and on the browser i get a redirect to a page like:

example.com/egg.aspx

example.com/egg.aspx

我能做些什么来检索

example.com/egg.aspx

example.com/egg.aspx

而不是

example.com/daora/zz.asp?x=qqrzzt

example.com/daora/zz.asp?x=qqrzzt

我认为这是相关代码,这是来自example.com/daora/zz.asp?x=qqrzzt"的代码:

I think this is relevant code, this is the code from "example.com/daora/zz.asp?x=qqrzzt":

<head>

<script language="JavaScript">

<!--
    function Submit()

    {
        document.formzz.submit();
    }
-->
</script>

</head>

<body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onLoad="javascript:Submit();">

<form name="formZZ" method="post" action="http://example.com/egg.aspx">

<input type="hidden" name="token" value="UFASGFJKASGDJFGAJS">

</form>

推荐答案

urllib.request 自动跟随重定向;你不需要做任何事情.

urllib.request follows redirects automatically; you don't need to do anything.

这里的问题是没有重定向可遵循.该网页在加载后立即使用 Javascript 来伪造表单提交.urllib 只是获取页面;它没有实现浏览器 DOM 和运行 Javascript 代码.

The problem here is that there is no redirect to follow. The web page uses Javascript to fake a form submission as soon as it's loaded. urllib just fetches the page; it doesn't implement a browser DOM and run Javascript code.

根据您需要脚本的通用程度,最简单的解决方案可能是一些hacky.例如,如果您只是想抓取 500 个结构相似但细节不同的页面,只需找到第一个 formaction 并导航到该页面.

Depending on how general you need your script to be, the simplest solution may be something hacky. For example, if you're just trying to spider 500 pages that all have a similar structure but different details, just find the action of the first form and navigate to that.

此外,如果获取页面和处理它们是两个不同的步骤,您可能想要使用超级简单的 Javascript/Greasemonkey 编写一个获取器(在浏览器中运行,因此它已经有了一个有效的 DOM 实现等)和一个单独的 Python 处理脚本(它只对最终获取/生成的 HTML 页面进行操作).

Also, if fetching the pages and processing them are two distinct steps, you may want to write a fetcher with super-simple Javascript/Greasemonkey (running in the browser, so it's already got a working DOM implementation, etc.) and a separate fancy processing script in Python (which just operates on the finally-fetched/generated HTML pages).

如果您需要完全通用,最简单的解决方案可能是使用 selenium 浏览器自动化框架.(或者,也许可以使用 PyWin32 或 PyObjC 来直接自动化 IE 或 Webkit.)

If you need to be fully general, the simplest solution is probably to use the selenium browser automation framework. (Or, maybe, PyWin32 or PyObjC to automate IE or Webkit directly.)

如果您想要最好的解决方案,并且拥有无限的资源……编写您自己的 DOM 实现并连接您最喜欢的 Javascript 解释器(可能是 spidermonkey 或 v8).这只是编写新浏览器的工作量的 2/3 左右.(并且您可能能够找到让您完成 80% 的部分.例如,如果您愿意使用 Jython 而不是 CPython 作为您的 Python 解释器,HtmlUnit 非常漂亮.)

If you want the best possible solution, and have infinite resources… write your own implementation of the DOM and hook up your favorite Javascript interpreter (probably spidermonkey or v8). That's only about 2/3rds as much work as writing a new browser. (And you may be able to find pieces that get you 80% of the way there. For example, if you're willing to use Jython instead of CPython as your Python interpreter, HtmlUnit is pretty slick.)

这篇关于如何使用 urllib 跟踪重定向?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆