解析原始HTTP标头 [英] Parse raw HTTP Headers

查看:119
本文介绍了解析原始HTTP标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一串原始HTTP,我想表示对象中的字段。有没有办法解析HTTP字符串中的各个标题?

I have a string of raw HTTP and I would like to represent the fields in an object. Is there any way to parse the individual headers from an HTTP string?

'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'


推荐答案

标准库中有很好的工具可用于解析RFC 821标头,也可用于解析整个HTTP请求。下面是一个示例请求字符串(请注意,Python将其视为一个大字符串,即使我们将其分为几行以便于阅读),我们可以将其提供给我的示例:

There are excellent tools in the Standard Library both for parsing RFC 821 headers, and also for parsing entire HTTP requests. Here is an example request string (note that Python treats it as one big string, even though we are breaking it across several lines for readability) that we can feed to my examples:

request_text = (
    'GET /who/ken/trust.html HTTP/1.1\r\n'
    'Host: cm.bell-labs.com\r\n'
    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
    'Accept: text/html;q=0.9,text/plain\r\n'
    '\r\n'
    )

正如@TryPyPy所指出的,您可以使用 mimetools.Message 来解析标题 - 尽管我们应该添加生成后,生成的消息对象就像标题字典一样:

As @TryPyPy points out, you can use mimetools.Message to parse the headers — though we should add that the resulting Message object acts like a dictionary of headers once you are done creating it:

# Ignore the request line and parse only the headers

from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))

print len(headers)     # -> "3"
print headers.keys()   # -> ['accept-charset', 'host', 'accept']
print headers['Host']  # -> "cm.bell-labs.com"

但这当然忽略了请求行,或者让你自己解析它。事实证明,有一个更好的解决方案。

But this, of course, ignores the request line, or makes you parse it yourself. It turns out that there is a much better solution.

如果你使用它的 BaseHTTPRequestHandler 。虽然它的文档有点模糊 - 标准库中的整套HTTP和URL工具存在问题 - 要使它解析字符串所需要做的就是(a)将字符串包装在中StringIO(),(b)读取 raw_requestline 以便它可以被解析,并且(c)捕获在发生期间发生的任何错误代码解析而不是让它尝试将它们写回客户端(因为我们没有!)。

The Standard Library will parse HTTP for you if you use its BaseHTTPRequestHandler. Though its documentation is a bit obscure — a problem with the whole suite of HTTP and URL tools in the Standard Library — all you have to do to make it parse a string is (a) wrap your string in a StringIO(), (b) read the raw_requestline so that it stands ready to be parsed, and (c) capture any error codes that occur during parsing instead of letting it try to write them back to the client (since we do not have one!).

所以这是我们对标准库类的专门化:

So here is our specialization of the Standard Library class:

from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO

class HTTPRequest(BaseHTTPRequestHandler):
    def __init__(self, request_text):
        self.rfile = StringIO(request_text)
        self.raw_requestline = self.rfile.readline()
        self.error_code = self.error_message = None
        self.parse_request()

    def send_error(self, code, message):
        self.error_code = code
        self.error_message = message

再次,我希望标准库人们已经意识到HTTP解析应该以一种不需要我们编写九行代码来正确调用它的方式进行分解,但是你能做什么呢?以下是如何使用这个简单的类:

Again, I wish the Standard Library folks had realized that HTTP parsing should be broken out in a way that did not require us to write nine lines of code to properly call it, but what can you do? Here is how you would use this simple class:

# Using this new class is really easy!

request = HTTPRequest(request_text)

print request.error_code       # None  (check this first)
print request.command          # "GET"
print request.path             # "/who/ken/trust.html"
print request.request_version  # "HTTP/1.1"
print len(request.headers)     # 3
print request.headers.keys()   # ['accept-charset', 'host', 'accept']
print request.headers['host']  # "cm.bell-labs.com"

如果解析过程中出错, error_code 将不会

If there is an error during parsing, the error_code will not be None:

# Parsing can result in an error code and message

request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')

print request.error_code     # 400
print request.error_message  # "Bad request syntax ('GET')"

我更喜欢使用标准库像这样因为我怀疑他们已经e了如果我尝试使用正则表达式重新实现Internet规范,那么就会遇到并解决任何可能会让我感到困惑的边缘情况。

I prefer using the Standard Library like this because I suspect that they have already encountered and resolved any edge cases that might bite me if I try re-implementing an Internet specification myself with regular expressions.

这篇关于解析原始HTTP标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆