如何处理包含非 utf8 字符的 GET 参数? [英] How to handle GET parameters containing non-utf8 characters?

查看:32
本文介绍了如何处理包含非 utf8 字符的 GET 参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在基于 nodejs/express 的应用程序中,我需要处理可能包含使用 iso-8859-1 字符集编码的变音符号的 GET 请求.

In a nodejs/express-based application I need to deal with GET requests which may contain umlauts encoded using the iso-8859-1 charset.

不幸的是,它的查询字符串解析器似乎只能处理纯 ASCII 和 UTF8:

Unfortunately its querystring parser seems to handle only plain ASCII and UTF8:

> qs.parse('foo=bar&xyz=foo%20bar')
{ foo: 'bar', xyz: 'foo bar' } # works fine
> qs.parse('foo=bar&xyz=T%FCt%20T%FCt')
{ foo: 'bar', xyz: 'T%FCt%20T%FCt' } # iso-8859-1 breaks, should be "Tüt Tüt"
> qs.parse('foo=bar&xyz=m%C3%B6p')
{ foo: 'bar', xyz: 'möp' } # utf8 works fine

是否有隐藏选项或另一种干净的方法可以使它与其他字符集一起使用?默认行为的主要问题是我无法知道是否存在解码错误 - 毕竟,输入 可能 已经被简单地解码为仍然看起来像的东西一个 urlencoded 字符串.

Is there a hidden option or another clean way to make this work with other charsets, too? The major problem with the default behaviour is that there is no way for me to know if there was a decoding error or not - after all, the input could have been something that simply decoded to something still looking like an urlencoded string.

推荐答案

好吧 URL 编码 应始终为 UTF-8,其他情况可以视为编码攻击并拒绝请求.不存在这样的作为非 utf8 字符的东西.我不知道为什么您的应用程序可以获取任何编码的查询字符串,但是如果您只使用字符集,您就可以使用浏览器页眉.对于 API 请求或其他任何请求,您可以指定 UTF-8 并将无效的 UTF-8 拒绝为错误请求.

Well URL encoding should always be in UTF-8, other cases can be treated as encoding attack and just reject the request. There is no such thing as a non-utf8 character. I don't know why your application could get query strings in any encoding but you will be fine with browsers if you just use a charset header on your pages. For API requests or whatever, you can specify UTF-8 and reject invalid UTF-8 as Bad Request.

如果您真的指的是 ISO-8859-1,那么它非常简单,因为字节与 unicode 代码点完全匹配.

If you really mean ISO-8859-1, then it's very simple because the bytes match unicode code points exactly.

'T%FCt%20T%FCt'.replace( /%([a-f0-9]{2})/gi, function( f, m1 ) {
    return String.fromCharCode(parseInt(m1, 16));
});

虽然在网络上它可能永远不会是 ISO-8859-1,但实际上是 Windows-1252.

Although it is probably never ISO-8859-1 on the web but Windows-1252 actually.

这篇关于如何处理包含非 utf8 字符的 GET 参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆