MITMProxy:智能URL替换 [英] MITMProxy: smart URL replacement

查看:403
本文介绍了MITMProxy:智能URL替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用的自定义抓取工具必须针对语言使用单独的网站(这是体系结构的局限性).像site1.co.uk,site1.es,site1.de等.

We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.

但是我们需要解析一个用url分隔的多种语言的网站-例如site2.com/en、site2.com/de、site2.com/es等.

But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.

我想到了MITMProxy:我可以通过这种方式重定向所有请求:

I thought about MITMProxy: I could redirect all requests this way:

en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...

我写了一个小脚本,它只使用URL并将其重写:

I have written a small script which simply takes URLs and rewrites them:

class MyMaster(flow.FlowMaster):

  def handle_request(self, r):
    url = r.get_url()

    # replace URLs
    if 'blabla' in url:
      r.set_url(url.replace('something', 'another'))

但是目标主机会使用网络服务器的响应生成301重定向-页面已移至此处"以及指向site2.com/en的链接

But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en

当我进行URL重写(即site2.com/en-> site2.com/de)时,此方法有效. 但是对于不同的主机(准确地说,是子域和根域),它不起作用.

It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de. But for different hosts (subdomain and the root domain, to be precise), it does not work.

我尝试从上方替换handle_request方法中的Host标头:

I tried to replace the Host header in the handle_request method from above:

for key in r.headers.keys():
        if key.lower() == 'host':
            r.headers[key] = ['site2.com']

我还尝试替换了引荐来源网址-所有这些都无济于事.

also I tried to replace the Referrer - all of that didn't help.

我最终如何将请求从子域欺骗到主域?如果它生成HTTP(s)客户端警告,则可以,因为我们需要刮板(并且可以关闭警告),而不是真正的浏览器.

How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.

谢谢!

推荐答案

您需要替换响应的内容,并仅用几个字段来构造标题. 打开一个到重定向URL的新连接并修改您的响应:

You need to replace the content of the response and craft the header with just a few fields. Open a new connection to the redirected url and craft your response :

def handle_request(self, flow):
  newUrl = <new-url>
  retryCount = 3
  newResponse = None
  while True:
    try:
      newResponse = requests.get(newUrl) # import requests
    except: 
      if retryCount == 0:
        print 'Cannot reach new url ' + newUrl
        traceback.print_exc() # import traceback
        return

      retryCount -= 1
      continue
    break

  responseHeaders = Headers() # from netlib.http import Headers

  if 'Date' in newResponse.headers:
    responseHeaders['Date'] = str(newResponse.headers['Date'])
  if 'Connection' in newResponse.headers:
    responseHeaders['Connection'] = str(newResponse.headers['Connection'])
  if 'Content-Type' in newResponse.headers:
    responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
  if 'Content-Length' in newResponse.headers:
    responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
  if 'Content-Encoding' in newResponse.headers:
    responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])

  response = HTTPResponse(   # from libmproxy.models import HTTPResponse
    http_version='HTTP/1.1',
    status_code=200,
    reason='OK',
    headers=responseHeaders,
    content=newResponse.content)

  flow.reply(response)

这篇关于MITMProxy:智能URL替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆