如何用 Python 编写 Web 代理 [英] How to write a web proxy in Python

查看:103
本文介绍了如何用 Python 编写 Web 代理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用 python 编写一个网络代理.目标是访问一个 url,如:http://proxyurl/http://anothersite.com/ 并查看 http://anothersite.com 的内容就像你平时一样.我已经通过滥用请求库获得了不错的成绩,但这并不是请求框架的真正用途.我之前用 之类的内容解析HTML文档,循环所有标签并检查以下属性:

'src', 'lowsrc', 'href'

并且相应地改变它们的值,使标签变成:

<img src="http://anothersite.com/header.png"/>

此方法适用于更多标签,而不仅仅是图像标签.ascriptlinkliframe 是您应该更改的一些

HTML 恶作剧

先前的方法应该能让您走得更远,但您还没有完成.

两者

<style type="text/css" media="all">@import "/stylesheet.css?version=120215094129002";</style>

<div style="position:absolute;right:8px;background-image:url('/Portals/_default/Skins/BE/images/top_img.gif');height:200px;width:427px;背景重复:无重复;背景位置:右上;">

是使用 BeautifulSoup 难以访问和修改的代码示例.

在第一个示例中,有一个 css @Import 到一个相对 uri.第二个涉及来自内联 CSS 语句的 'url()' 方法.

在我的情况下,我最终编写了可怕的代码来手动修改这些值.您可能想为此使用正则表达式,但我不确定.

重定向

使用 Python-Requests 或 Urllib2,您可以轻松地自动跟踪重定向.请记住保存新的(基本)uri 是什么;您将需要它来执行将属性值从相对更改为绝对"操作.

您还需要处理硬编码"重定向.比如这个:

需要改为:

基本标签

基本标签指定文档中所有相对 URL 的基本 URL/目标.您可能想要更改该值.

终于完成了吗?

没有.一些网站严重依赖 JavaScript 在屏幕上绘制内容.这些站点是最难代理的.我一直在考虑使用诸如 PhantomJSGhost 来获取和评估网页并将结果呈现给客户端.

也许是我的源代码能帮你.您可以以任何方式使用它.

I'm trying to write a web proxy in python. The goal is to visit a url like: http://proxyurl/http://anothersite.com/ and see he contents of http://anothersite.com just like you would normally. I've gotten decently far by abusing the requests library, but this isn't really the intended use of the requests framework. I've written proxies with twisted before, but I'm not sure how to connect this into what I'm trying to do. Here's where I'm at so far...

import os
import urlparse

import requests

import tornado.ioloop
import tornado.web
from tornado import template

ROOT = os.path.dirname(os.path.abspath(__file__))
path = lambda *a: os.path.join(ROOT, *a)

loader = template.Loader(path(ROOT, 'templates'))


class ProxyHandler(tornado.web.RequestHandler):
    def get(self, slug):
        if slug.startswith("http://") or slug.startswith("https://"):
            if self.get_argument("start", None) == "true":
                parsed = urlparse.urlparse(slug)
                self.set_cookie("scheme", value=parsed.scheme)
                self.set_cookie("netloc", value=parsed.netloc)
                self.set_cookie("urlpath", value=parsed.path)
            #external resource
            else:
                response = requests.get(slug)
                headers = response.headers
                if 'content-type' in headers:
                    self.set_header('Content-type', headers['content-type'])
                if 'length' in headers:
                    self.set_header('length', headers['length'])
                for block in response.iter_content(1024):
                    self.write(block)
                self.finish()
                return
        else:
            #absolute
            if slug.startswith('/'):
                slug = "{scheme}://{netloc}{original_slug}".format(
                    scheme=self.get_cookie('scheme'),
                    netloc=self.get_cookie('netloc'),
                    original_slug=slug,
                )
            #relative
            else:
                slug = "{scheme}://{netloc}{path}{original_slug}".format(
                    scheme=self.get_cookie('scheme'),
                    netloc=self.get_cookie('netloc'),
                    path=self.get_cookie('urlpath'),
                    original_slug=slug,
                )
        response = requests.get(slug)
        #get the headers
        headers = response.headers
        #get doctype
        doctype = None
        if '<!doctype' in response.content.lower()[:9]:
            doctype = response.content[:response.content.find('>')+1]
        if 'content-type' in headers:
           self.set_header('Content-type', headers['content-type'])
        if 'length' in headers:
            self.set_header('length', headers['length'])
        self.write(response.content)


application = tornado.web.Application([
    (r"/(.+)", ProxyHandler),
])

if __name__ == "__main__":
    application.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

Just a note, I set a cookie to preserve the scheme, netloc, and urlpath if the there's start=true in the querystring. That way, any relative or absolute link that then hits the proxy uses that cookie to resolve the full url.

With this code, if you go to http://localhost:8888/http://espn.com/?start=true you'll see the contents of ESPN. However, on the following site it doesn't work at all: http://www.bottegaveneta.com/us/shop/. My question is, what's the best way to do this? Is the current way I'm implementing this robust or are there some terrible pitfalls to doing it this way? If it is correct, why are certain sites like the one I pointed out not working at all?

Thank you for any help.

解决方案

I have recently wrote a similiar web-application. Note that this is the way I did it. I'm not saying you should do it like this. These are some of the pitfalls I came across:

Changing attribute values from relative to absolute

There is much more involved than just fetching a page and presenting it to the client. Many times you're not able to proxy the webpage without any errors.

Why are certain sites like the one I pointed out not working at all?

Many webpages rely on relative paths to resources in order to display the webpage in a well formatted manner. For example, this image tag:

<img src="/header.png" />

Will result in the client doing a request to:

http://proxyurl/header.png

Which fails. The 'src' value should be converted to:

http://anothersite.com/header.png.

So, you need to parse the HTML document with something like BeautifulSoup, loop over all the tags and check for attributes such as:

'src', 'lowsrc', 'href'

And change their values accordingly so that the tag becomes:

<img src="http://anothersite.com/header.png" />

This method applies to more tags than just the image one. a, script, link, li and frame are a few you should change as well.

HTML shenanigans

The prior method should get you far, but you're not done yet.

Both

<style type="text/css" media="all">@import "/stylesheet.css?version=120215094129002";</style>

And

<div style="position:absolute;right:8px;background-image:url('/Portals/_default/Skins/BE/images/top_img.gif');height:200px;width:427px;background-repeat:no-repeat;background-position:right top;" >

are examples of code that's difficult to reach and modify using BeautifulSoup.

In the first example there is a css @Import to a relative uri. The second one concerns the 'url()' method from an inline CSS statement.

In my situation, I ended up writing horrible code to manually modify these values. You may want to use Regex for this but I'm not sure.

Redirects

With Python-Requests or Urllib2 you can easily follow redirects automatically. Just remember to save what the new (base)uri is; you'll need it for the 'changing the attributes values from relative to absolute' operation.

You also need to deal with 'hardcoded' redirects. Such as this one:

<meta http-equiv="refresh" content="0;url=http://new-website.com/">

Needs to be changed to:

<meta http-equiv="refresh" content="0;url=http://proxyurl/http://new-website.com/">

Base tag

The base tag specifies the base URL/target for all relative URLs in a document. You probably want to change the value.

Finally done?

Nope. Some websites rely heavily on javascript to draw their content on screen. These sites are the hardest to proxy. I've been thinking about using something like PhantomJS or Ghost to fetch and evaluate webpages and presenting the result to the client.

Maybe my source code can help you. You can use it in any way you want.

这篇关于如何用 Python 编写 Web 代理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆