在 Tomcat 上处理 URI 中的字符编码 [英] Handling Character Encoding in URI on Tomcat

查看:24
本文介绍了在 Tomcat 上处理 URI 中的字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我尝试帮助的网站上,用户可以在浏览器中输入 URL,例如跟随汉字,

On the web site I am trying to help with, user can type in an URL in the browser, like following Chinese characters,

  http://localhost:8080?a=测试

在服务器上,我们得到

  GET /a=%E6%B5%8B%E8%AF%95 HTTP/1.1

如您所见,它是 UTF-8 编码的,然后是 URL 编码的.我们可以通过在 Tomcat 中将编码设置为 UTF-8 来正确处理这个问题.

As you can see, it's UTF-8 encoded, then URL encoded. We can handle this correctly by setting encoding to UTF-8 in Tomcat.

但是,有时我们会在某些浏览器上获得 Latin1 编码,

However, sometimes we get Latin1 encoding on certain browsers,

  http://localhost:8080?a=ß

变成

  GET /a=%DF HTTP/1.1

无论如何在Tomcat中正确处理这个问题?看起来服务器必须做一些智能猜测.我们不希望 100% 正确处理 Latin1,但任何事情都比我们现在所做的一切都好,假设一切都是 UTF-8.

Is there anyway to handle this correctly in Tomcat? Looks like the server has to do some intelligent guessing. We don't expect to handle the Latin1 correctly 100% but anything is better than what we are doing now by assuming everything is UTF-8.

服务器是Tomcat 5.5.支持的浏览器为 IE 6+、Firefox 2+ 和 iPhone 上的 Safari.

The server is Tomcat 5.5. The supported browsers are IE 6+, Firefox 2+ and Safari on iPhone.

推荐答案

不幸的是,UTF-8 编码是应该"的在 URI 规范 中,这似乎假设源服务器将以对目标服务器有意义的方式生成所有 URL.

Unfortunately, UTF-8 encoding is a "should" in the URI specification, which seems to assume that the origin server will generate all URLs in such a way that they will be meaningful to the destination server.

我会考虑几种技术;所有都涉及自己解析查询字符串(尽管您可能比我更清楚设置请求编码是影响查询字符串到参数映射还是仅影响正文).

There are a couple of techniques that I would consider; all involve parsing the query string yourself (although you may know better than I whether setting the request encoding affects the query string to parameter mapping or just the body).

首先,检查单个高字节"的查询字符串:有效的 UTF-8 序列必须有两个或更多字节(维基百科条目 有一个很好的有效和无效字节表).

First, examine the query string for single "high-bytes": a valid UTF-8 sequence must have two or more bytes (the Wikipedia entry has a nice table of valid and invalid bytes).

看起来不太可靠的是接受字符集"请求中的标头.我不认为这个标头是必需的(还没有查看 HTTP 规范来验证),而且我知道 Firefox 至少会发送一个完整的可接受值列表.选择列表中的第一个值可能有效,也可能无效.

Less reliable would be to look a the "Accept-Charset" header in the request. I don't think this header is required (haven't looked at the HTTP spec to verify), and I know that Firefox, at least, will send a whole list of acceptable values. Picking the first value in the list might work, or it might not.

最后,您是否对日志进行了任何分析,以查看特定用户代理是否会始终使用这种编码?

Finally, have you done any analysis on the logs, to see if a particular user-agent will consistently use this encoding?

这篇关于在 Tomcat 上处理 URI 中的字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆