如何确定一个String是否包含无效的编码字符 [英] How to determine if a String contains invalid encoded characters

查看:147
本文介绍了如何确定一个String是否包含无效的编码字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用场景



我们已经实现了Web前端开发人员(通过php api)在内部使用的Web服务来显示产品数据。在网站上用户输入的内容(即查询字符串)。在内部,网站通过api呼叫服务。



注意:我们使用restlet,而不是tomcat



原始问题



Firefox 3.0.10似乎尊重浏览器中选择的编码并对网址进行编码根据选择的编码。这确实导致ISO-8859-1和UTF-8的不同查询字符串。



我们的网站转发用户的输入,不转换它应该),所以它可以通过使用包含德语变音符的查询字符串调用webservice的api调用该服务。



查询部分看起来像

  ... v =abcädef

如果选择ISO-8859-1,发送的查询部分看起来像

  ... v = abc%E4def 

但如果UTF-8被选中,发送的查询部分看起来像

  ... v = abc%C3%A4def 

希望的解决方案



当我们控制服务,因为我们已经实现了,我们想检查服务器端,呼叫包含非utf-8个字符,如果是,请用4xx http状态回复

当前解决方案



检查每个字符(== string.substring(i,i + 1) )


  1. 如果character.getBytes()[0]等于63,为'?'

  2. if Character.getType(character.charAt(0))返回OTHER_SYMBOL

代码

 受保护的列表<字符串> getNonUnicodeCharacters(String s){
final List<字符串> result = new ArrayList< String>(); (int i = 0,n = s.length(); i< n; i ++){
final String character = s.substring(i,i + 1);

final boolean isOtherSymbol =
(int)Character.OTHER_SYMBOL
== Character.getType(character.charAt(0));
final boolean isNonUnicode = isOtherSymbol
&&& character.getBytes()[0] ==(byte)63;
if(isNonUnicode)
result.add(character);
}
返回结果;
}

问题



这将捕获所有无效(非utf编码)字符?
你们有没有一个更好的(更容易)的解决方案?



注意:我使用以下代码检查了URLDecoder

  final String [] test = new String [] {
v = abc%E4def,
v = abc%C3%A4def
}; (int i = 0,n = test.length; i< n; i ++){
System.out.println(java.net.URLDecoder.decode(test [i],UTF -8));
System.out.println(java.net.URLDecoder.decode(test [i],ISO-8859-1));
}

打印:

  v = abc?def 
v =abcädef
v =abcädef
v =abcädef

并且它抛出一个IllegalArgumentException

解决方案

我问同样的问题,



在Tomcat中的URI中处理字符编码



我最近发现了一个解决方案,它的工作原理对我好你可能想试试看。这是您需要做的,


  1. 将您的URI编码保留为Latin-1。在Tomcat上,在server.xml中的连接器中添加URIEncoding =ISO-8859-1。

  2. 如果您需要手动URL解码,请将Latin1作为字符集使用。

  3. 使用fixEncoding()函数修复编码。

例如,从查询中获取参数字符串

  String name = fixEncoding(request.getParameter(name)); 

你可以永远这样做。具有正确编码的字符串不会更改。



代码已附加。祝你好运!

  public static String fixEncoding(String latin1){
try {
byte [] bytes = latin1.getBytes(ISO-8859-1);
if(!validUTF8(bytes))
return latin1;
return new String(bytes,UTF-8);
} catch(UnsupportedEncodingException e){
//不可能,不取消选中
throw new IllegalStateException(No Latin1 or UTF-8:+ e.getMessage());
}

}

public static boolean validUTF8(byte [] input){
int i = 0;
//检查BOM
if(input.length> = 3&&(input [0]& 0xFF)== 0xEF
&&(input [1 ]& 0xFF)== 0xBB&(input [2]& 0xFF)== 0xBF){
i = 3;
}

int end;
for(int j = input.length; i< j; ++ i){
int octet = input [i];
if((octet& 0x80)== 0){
continue; // ASCII
}

//检查UTF-8前导字节
if((octet& 0xE0)== 0xC0){
end = i + 1;
} else if((octet& 0xF0)== 0xE0){
end = i + 2;
} else if((octet& 0xF8)== 0xF0){
end = i + 3;
} else {
// Java只支持BMP,所以3是max
return false;
}

while(i< end){
i ++;
octet = input [i];
if((octet& 0xC0)!= 0x80){
//不是有效的尾字节
return false;
}
}
}
返回true;
}

编辑:您的方法由于各种原因不起作用。当有编码错误时,您不能指望从Tomcat获得的内容。有时候你会得到 或?其他时候,你不会得到任何东西,getParameter()返回null。假设你可以检查?,你的查询字符串包含有效的? ?



此外,您不应拒绝任何请求。这不是您的用户的错误。正如我在原来的问题中提到的,浏览器可能会将UTF-8或Latin-1中的URL进行编码。用户无法控制你需要同时接受。将您的servlet更改为Latin-1将保留所有字符,即使它们是错误的,给我们一个机会解决它或将其丢弃。



我在这里发布的解决方案并不完美,但它是迄今为止发现的最好的解决方案。


Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

    ...v=abcädef

if "ISO-8859-1" is selected, the sent query part looks like

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

...v=abc%C3%A4def

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

  1. if character.getBytes()[0] equals 63 for '?'
  2. if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

protected List< String > getNonUnicodeCharacters( String s ) {
  final List< String > result = new ArrayList< String >();
  for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
    final String character = s.substring( i , i + 1 );
    final boolean isOtherSymbol = 
      ( int ) Character.OTHER_SYMBOL
       == Character.getType( character.charAt( 0 ) );
    final boolean isNonUnicode = isOtherSymbol 
      && character.getBytes()[ 0 ] == ( byte ) 63;
    if ( isNonUnicode )
      result.add( character );
  }
  return result;
}

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

final String[] test = new String[]{
  "v=abc%E4def",
  "v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
    System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
    System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}

This prints:

v=abc?def
v=abcädef
v=abcädef
v=abcädef

and it does not throw an IllegalArgumentException sigh

解决方案

I asked the same question,

Handling Character Encoding in URI on Tomcat

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

  1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
  2. If you have to manually URL decode, use Latin1 as charset also.
  3. Use the fixEncoding() function to fix up encodings.

For example, to get a parameter from query string,

  String name = fixEncoding(request.getParameter("name"));

You can do this always. String with correct encoding is not changed.

The code is attached. Good luck!

 public static String fixEncoding(String latin1) {
  try {
   byte[] bytes = latin1.getBytes("ISO-8859-1");
   if (!validUTF8(bytes))
    return latin1;   
   return new String(bytes, "UTF-8");  
  } catch (UnsupportedEncodingException e) {
   // Impossible, throw unchecked
   throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
  }

 }

 public static boolean validUTF8(byte[] input) {
  int i = 0;
  // Check for BOM
  if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
    && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
   i = 3;
  }

  int end;
  for (int j = input.length; i < j; ++i) {
   int octet = input[i];
   if ((octet & 0x80) == 0) {
    continue; // ASCII
   }

   // Check for UTF-8 leading byte
   if ((octet & 0xE0) == 0xC0) {
    end = i + 1;
   } else if ((octet & 0xF0) == 0xE0) {
    end = i + 2;
   } else if ((octet & 0xF8) == 0xF0) {
    end = i + 3;
   } else {
    // Java only supports BMP so 3 is max
    return false;
   }

   while (i < end) {
    i++;
    octet = input[i];
    if ((octet & 0xC0) != 0x80) {
     // Not a valid trailing byte
     return false;
    }
   }
  }
  return true;
 }

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

The solution I posted here is not perfect but it's the best one we found so far.

这篇关于如何确定一个String是否包含无效的编码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆