如何确定字符串是否包含无效的编码字符 [英] How to determine if a String contains invalid encoded characters
问题描述
使用场景
我们已经实现了一个 Web 服务,我们的 Web 前端开发人员在内部使用它(通过 php api)来显示产品数据.在网站上,用户输入一些东西(即查询字符串).在内部,网站通过 api 调用服务.
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
注意:我们使用的是restlet,不是tomcat
原始问题
Firefox 3.0.10 似乎尊重浏览器中选定的编码,并根据选定的编码对 url 进行编码.这确实会导致 ISO-8859-1 和 UTF-8 的查询字符串不同.
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
我们的网站转发来自用户的输入并且不转换它(它应该转换),因此它可以通过使用包含德语变音符号的查询字符串调用网络服务的 api 调用服务.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
即对于看起来像
...v=abcädef
如果选择ISO-8859-1",发送的查询部分如下
if "ISO-8859-1" is selected, the sent query part looks like
...v=abc%E4def
但是如果选择UTF-8",发送的查询部分看起来像
but if "UTF-8" is selected, the sent query part looks like
...v=abc%C3%A4def
想要的解决方案
当我们控制服务时,因为我们已经实现了它,我们想要检查服务器端调用是否包含非 utf-8 字符,如果是,则以 4xx http 状态响应
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
当前解决方案的详细信息
检查每个字符 ( == string.substring(i,i+1) )
Check for each character ( == string.substring(i,i+1) )
- 如果 character.getBytes()[0] 对于 '?' 等于 63
- 如果 Character.getType(character.charAt(0)) 返回 OTHER_SYMBOL
代码
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
}
return result;
}
问题
这会捕获所有无效(非 utf 编码)字符吗?你们中有人有更好(更简单)的解决方案吗?
Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?
注意:我使用以下代码检查了 URLDecoder
Note: I checked URLDecoder with the following code
final String[] test = new String[]{
"v=abc%E4def",
"v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}
打印:
v=abc?def
v=abcädef
v=abcädef
v=abcädef
并且它不会抛出一个 IllegalArgumentException sigh
and it does not throw an IllegalArgumentException sigh
推荐答案
我问了同样的问题,
我最近找到了一个解决方案,它对我来说效果很好.你可能想试一试.这是你需要做的,
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
- 将您的 URI 编码保留为 Latin-1.在 Tomcat 上,将 URIEncoding="ISO-8859-1" 添加到 server.xml 中的连接器.
- 如果您必须手动进行 URL 解码,也可以使用 Latin1 作为字符集.
- 使用 fixEncoding() 函数修复编码.
例如,从查询字符串中获取参数,
For example, to get a parameter from query string,
String name = fixEncoding(request.getParameter("name"));
您可以随时执行此操作.编码正确的字符串不会改变.
You can do this always. String with correct encoding is not changed.
附上代码.祝你好运!
public static String fixEncoding(String latin1) {
try {
byte[] bytes = latin1.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return latin1;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// Impossible, throw unchecked
throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
由于各种原因,您的方法不起作用.当出现编码错误时,您不能指望从 Tomcat 得到什么.有时你会得到 或 ?.其他时候,你什么也得不到,getParameter() 返回 null.假设您可以检查?",您的查询字符串包含有效的?"会发生什么??
Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
此外,您不应拒绝任何请求.这不是您用户的错.正如我在我的原始问题中提到的,浏览器可能会以 UTF-8 或 Latin-1 对 URL 进行编码.用户没有控制权.你需要接受两者.将您的 servlet 更改为 Latin-1 将保留所有字符,即使它们是错误的,也让我们有机会修复它或将其丢弃.
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
我在这里发布的解决方案并不完美,但它是我们迄今为止找到的最好的解决方案.
The solution I posted here is not perfect but it's the best one we found so far.
这篇关于如何确定字符串是否包含无效的编码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!