如何处理包含非utf8字符的GET参数? [英] How to handle GET parameters containing non-utf8 characters?

查看:308
本文介绍了如何处理包含非utf8字符的GET参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在基于nodejs / express的应用程序中,我需要处理可能包含使用iso-8859-1字符集编码的变音符号的GET请求。



解析器似乎只处理纯ASCII和UTF8:

 > qs.parse('foo = bar& xyz = foo%20bar')
{foo:'bar',xyz:'foo bar'}#works fine
> qs.parse('foo = bar& xyz = T%FCt%20T%F​​Ct')
{foo:'bar',xyz:'T%FCt%20T%F​​Ct'}#iso-8859-1 ,应该是TütTüt
> qs.parse('foo = bar& xyz = m%C3%B6p')
{foo:'bar',xyz:'möp'}#utf8 works fine

是否有隐藏选项或其他干净的方式,使这个工作与其他字符集吗?默认行为的主要问题是,我没有办法知道是否有解码错误 - 毕竟,输入可以是一个东西,只是解码到仍然看起来像

解决方案

URL编码应始终为UTF-8,其他情况下可视为编码攻击,只拒绝请求。没有这样的
的东西作为非utf8字符。我不知道为什么你的应用程序可以获得任何编码的查询字符串,但如果你只是使用字符集
标题在你的页面上,你会很好的浏览器。对于API请求或其他,您可以指定UTF-8并拒绝无效的UTF-8作为错误请求。



如果你真的是ISO-8859-1,因为字节匹配unicode代码点。

 'T%FCt%20T%F​​Ct'.replace a-f0-9] {2})/ gi,function(f,m1){
return String.fromCharCode(parseInt(m1,16​​));
});

虽然它可能从来没有ISO-8859-1在网络上,但Windows-1252实际上。 p>

In a nodejs/express-based application I need to deal with GET requests which may contain umlauts encoded using the iso-8859-1 charset.

Unfortunately its querystring parser seems to handle only plain ASCII and UTF8:

> qs.parse('foo=bar&xyz=foo%20bar')
{ foo: 'bar', xyz: 'foo bar' } # works fine
> qs.parse('foo=bar&xyz=T%FCt%20T%FCt')
{ foo: 'bar', xyz: 'T%FCt%20T%FCt' } # iso-8859-1 breaks, should be "Tüt Tüt"
> qs.parse('foo=bar&xyz=m%C3%B6p')
{ foo: 'bar', xyz: 'möp' } # utf8 works fine

Is there a hidden option or another clean way to make this work with other charsets, too? The major problem with the default behaviour is that there is no way for me to know if there was a decoding error or not - after all, the input could have been something that simply decoded to something still looking like an urlencoded string.

解决方案

Well URL encoding should always be in UTF-8, other cases can be treated as encoding attack and just reject the request. There is no such thing as a non-utf8 character. I don't know why your application could get query strings in any encoding but you will be fine with browsers if you just use a charset header on your pages. For API requests or whatever, you can specify UTF-8 and reject invalid UTF-8 as Bad Request.

If you really mean ISO-8859-1, then it's very simple because the bytes match unicode code points exactly.

'T%FCt%20T%FCt'.replace( /%([a-f0-9]{2})/gi, function( f, m1 ) {
    return String.fromCharCode(parseInt(m1, 16));
});

Although it is probably never ISO-8859-1 on the web but Windows-1252 actually.

这篇关于如何处理包含非utf8字符的GET参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆