如何确定一个String是否包含无效的编码字符 [英] How to determine if a String contains invalid encoded characters
问题描述
使用场景
我们已经实现了Web前端开发人员(通过php api)在内部使用的Web服务来显示产品数据。在网站上用户输入的内容(即查询字符串)。在内部,网站通过api呼叫服务。
注意:我们使用restlet,而不是tomcat
原始问题
Firefox 3.0.10似乎尊重浏览器中选择的编码并对网址进行编码根据选择的编码。这确实导致ISO-8859-1和UTF-8的不同查询字符串。
我们的网站转发用户的输入,不转换它应该),所以它可以通过使用包含德语变音符的查询字符串调用webservice的api调用该服务。
查询部分看起来像
... v =abcädef
如果选择ISO-8859-1,发送的查询部分看起来像
... v = abc%E4def
但如果UTF-8被选中,发送的查询部分看起来像
... v = abc%C3%A4def
希望的解决方案
当我们控制服务,因为我们已经实现了,我们想检查服务器端,呼叫包含非utf-8个字符,如果是,请用4xx http状态回复
当前解决方案
检查每个字符(== string.substring(i,i + 1) )
- 如果character.getBytes()[0]等于63,为'?'
- if Character.getType(character.charAt(0))返回OTHER_SYMBOL
代码
受保护的列表<字符串> getNonUnicodeCharacters(String s){
final List<字符串> result = new ArrayList< String>(); (int i = 0,n = s.length(); i< n; i ++){
final String character = s.substring(i,i + 1);
final boolean isOtherSymbol =
(int)Character.OTHER_SYMBOL
== Character.getType(character.charAt(0));
final boolean isNonUnicode = isOtherSymbol
&&& character.getBytes()[0] ==(byte)63;
if(isNonUnicode)
result.add(character);
}
返回结果;
}
问题
这将捕获所有无效(非utf编码)字符?
你们有没有一个更好的(更容易)的解决方案?
注意:我使用以下代码检查了URLDecoder
final String [] test = new String [] {
v = abc%E4def,
v = abc%C3%A4def
}; (int i = 0,n = test.length; i< n; i ++){
System.out.println(java.net.URLDecoder.decode(test [i],UTF -8));
System.out.println(java.net.URLDecoder.decode(test [i],ISO-8859-1));
}
打印:
v = abc?def
v =abcädef
v =abcädef
v =abcädef
并且它不抛出一个IllegalArgumentException
我问同样的问题,
我最近发现了一个解决方案,它的工作原理对我好你可能想试试看。这是您需要做的,
- 将您的URI编码保留为Latin-1。在Tomcat上,在server.xml中的连接器中添加URIEncoding =ISO-8859-1。
- 如果您需要手动URL解码,请将Latin1作为字符集使用。
- 使用fixEncoding()函数修复编码。
例如,从查询中获取参数字符串
String name = fixEncoding(request.getParameter(name));
你可以永远这样做。具有正确编码的字符串不会更改。
代码已附加。祝你好运!
public static String fixEncoding(String latin1){
try {
byte [] bytes = latin1.getBytes(ISO-8859-1);
if(!validUTF8(bytes))
return latin1;
return new String(bytes,UTF-8);
} catch(UnsupportedEncodingException e){
//不可能,不取消选中
throw new IllegalStateException(No Latin1 or UTF-8:+ e.getMessage());
}
}
public static boolean validUTF8(byte [] input){
int i = 0;
//检查BOM
if(input.length> = 3&&(input [0]& 0xFF)== 0xEF
&&(input [1 ]& 0xFF)== 0xBB&(input [2]& 0xFF)== 0xBF){
i = 3;
}
int end;
for(int j = input.length; i< j; ++ i){
int octet = input [i];
if((octet& 0x80)== 0){
continue; // ASCII
}
//检查UTF-8前导字节
if((octet& 0xE0)== 0xC0){
end = i + 1;
} else if((octet& 0xF0)== 0xE0){
end = i + 2;
} else if((octet& 0xF8)== 0xF0){
end = i + 3;
} else {
// Java只支持BMP,所以3是max
return false;
}
while(i< end){
i ++;
octet = input [i];
if((octet& 0xC0)!= 0x80){
//不是有效的尾字节
return false;
}
}
}
返回true;
}
编辑:您的方法由于各种原因不起作用。当有编码错误时,您不能指望从Tomcat获得的内容。有时候你会得到 或?其他时候,你不会得到任何东西,getParameter()返回null。假设你可以检查?,你的查询字符串包含有效的? ?
此外,您不应拒绝任何请求。这不是您的用户的错误。正如我在原来的问题中提到的,浏览器可能会将UTF-8或Latin-1中的URL进行编码。用户无法控制你需要同时接受。将您的servlet更改为Latin-1将保留所有字符,即使它们是错误的,给我们一个机会解决它或将其丢弃。
我在这里发布的解决方案并不完美,但它是迄今为止发现的最好的解决方案。
Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
Note: We use restlet, not tomcat
Original Problem
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
I.e. for a query part looking like
...v=abcädef
if "ISO-8859-1" is selected, the sent query part looks like
...v=abc%E4def
but if "UTF-8" is selected, the sent query part looks like
...v=abc%C3%A4def
Desired Solution
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
Current Solution In Detail
Check for each character ( == string.substring(i,i+1) )
- if character.getBytes()[0] equals 63 for '?'
- if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
Code
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
}
return result;
}
Question
Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?
Note: I checked URLDecoder with the following code
final String[] test = new String[]{
"v=abc%E4def",
"v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}
This prints:
v=abc?def
v=abcädef
v=abcädef
v=abcädef
and it does not throw an IllegalArgumentException sigh
I asked the same question,
Handling Character Encoding in URI on Tomcat
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
- Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
- If you have to manually URL decode, use Latin1 as charset also.
- Use the fixEncoding() function to fix up encodings.
For example, to get a parameter from query string,
String name = fixEncoding(request.getParameter("name"));
You can do this always. String with correct encoding is not changed.
The code is attached. Good luck!
public static String fixEncoding(String latin1) {
try {
byte[] bytes = latin1.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return latin1;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// Impossible, throw unchecked
throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
The solution I posted here is not perfect but it's the best one we found so far.
这篇关于如何确定一个String是否包含无效的编码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!