处理Tomcat URI中的字符编码 [英] Handling Character Encoding in URI on Tomcat

查看:99
本文介绍了处理Tomcat URI中的字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我想要帮助的网站上,用户可以在浏览器中输入一个URL,如下面的汉字,

On the web site I am trying to help with, user can type in an URL in the browser, like following Chinese characters,

  http://localhost:8080?a=测试

在服务器上, p>

On server, we get

  GET /a=%E6%B5%8B%E8%AF%95 HTTP/1.1

如你所见,它是UTF-8编码,然后是URL编码。我们可以通过在Tomcat中将编码设置为UTF-8来正确处理。

As you can see, it's UTF-8 encoded, then URL encoded. We can handle this correctly by setting encoding to UTF-8 in Tomcat.

但是,有时我们在某些浏览器上获得Latin1编码,

However, sometimes we get Latin1 encoding on certain browsers,

  http://localhost:8080?a=ß

变成

  GET /a=%DF HTTP/1.1

在Tomcat中是否有正确的处理方法?看起来服务器要做一些聪明的猜测。我们不希望正确处理Latin1 100%,但任何事情都比现在所做的更好,假设一切都是UTF-8。

Is there anyway to handle this correctly in Tomcat? Looks like the server has to do some intelligent guessing. We don't expect to handle the Latin1 correctly 100% but anything is better than what we are doing now by assuming everything is UTF-8.

服务器是Tomcat 5.5 。支持的浏览器是IE 6+,Firefox 2+和iPhone上的Safari。

The server is Tomcat 5.5. The supported browsers are IE 6+, Firefox 2+ and Safari on iPhone.

推荐答案

不幸的是,UTF-8编码是应该在 URI规范中,似乎假定原始服务器将生成所有URL都以这样的方式对目标服务器有意义。

Unfortunately, UTF-8 encoding is a "should" in the URI specification, which seems to assume that the origin server will generate all URLs in such a way that they will be meaningful to the destination server.

有几种我会考虑的技巧;所有这些都涉及到自己解析查询字符串(尽管你可能比我知道的是否设置请求编码会影响查询字符串到参数映射或仅仅是正文)。

There are a couple of techniques that I would consider; all involve parsing the query string yourself (although you may know better than I whether setting the request encoding affects the query string to parameter mapping or just the body).

首先,检查单个高字节的查询字符串:有效的UTF-8序列必须有两个或更多字节(维基百科条目有一个有效和无效字节的好表)。

First, examine the query string for single "high-bytes": a valid UTF-8 sequence must have two or more bytes (the Wikipedia entry has a nice table of valid and invalid bytes).

较不可靠的是看一个Accept-Charset标题在请求中。我不认为这个头是必需的(没有看过HTTP规范来验证),而且我知道Firefox至少会发送一个可接受的值的整个列表。选择列表中的第一个值可能会起作用,也可能不会。

Less reliable would be to look a the "Accept-Charset" header in the request. I don't think this header is required (haven't looked at the HTTP spec to verify), and I know that Firefox, at least, will send a whole list of acceptable values. Picking the first value in the list might work, or it might not.

最后,您是否对日志进行了任何分析,以查看特定的用户代理是否一直使用这种编码?

Finally, have you done any analysis on the logs, to see if a particular user-agent will consistently use this encoding?

这篇关于处理Tomcat URI中的字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆