如何确定一个String是否包含无效的编码字符 [英] How to determine if a String contains invalid encoded characters

查看：147 发布时间：2017/8/16 19:44:04 java string unicode encoding

本文介绍了如何确定一个String是否包含无效的编码字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用场景

我们已经实现了Web前端开发人员（通过php api）在内部使用的Web服务来显示产品数据。在网站上用户输入的内容（即查询字符串）。在内部，网站通过api呼叫服务。

注意：我们使用restlet，而不是tomcat

原始问题

Firefox 3.0.10似乎尊重浏览器中选择的编码并对网址进行编码根据选择的编码。这确实导致ISO-8859-1和UTF-8的不同查询字符串。

我们的网站转发用户的输入，不转换它应该），所以它可以通过使用包含德语变音符的查询字符串调用webservice的api调用该服务。

查询部分看起来像

  ... v =abcädef

如果选择ISO-8859-1，发送的查询部分看起来像

  ... v = abc％E4def

但如果UTF-8被选中，发送的查询部分看起来像

  ... v = abc％C3％A4def

希望的解决方案

当我们控制服务，因为我们已经实现了，我们想检查服务器端，呼叫包含非utf-8个字符，如果是，请用4xx http状态回复

当前解决方案

检查每个字符（== string.substring（i，i + 1））

如果character.getBytes（）[0]等于63，为'？'

if Character.getType（character.charAt（0））返回OTHER_SYMBOL

代码

 受保护的列表<字符串> getNonUnicodeCharacters（String s）{
 final List<字符串> result = new ArrayList< String>（）; （int i = 0，n = s.length（）; i< n; i ++）{
 final String character = s.substring（i，i + 1）; 
 
 final boolean isOtherSymbol = 
（int）Character.OTHER_SYMBOL 
 == Character.getType（character.charAt（0））; 
 final boolean isNonUnicode = isOtherSymbol 
&&& character.getBytes（）[0] ==（byte）63; 
 if（isNonUnicode）
 result.add（character）; 
} 
返回结果; 
}

问题

这将捕获所有无效（非utf编码）字符？
你们有没有一个更好的（更容易）的解决方案？

注意：我使用以下代码检查了URLDecoder

  final String [] test = new String [] {
v = abc％E4def，
v = abc％C3％A4def
}; （int i = 0，n = test.length; i< n; i ++）{
 System.out.println（java.net.URLDecoder.decode（test [i]，UTF -8））; 
 System.out.println（java.net.URLDecoder.decode（test [i]，ISO-8859-1））; 
}

打印：

  v = abc？def 
v =abcädef
v =abcädef
v =abcÃ¤def

并且它不抛出一个IllegalArgumentException

解决方案

我问同样的问题，

在Tomcat中的URI中处理字符编码

我最近发现了一个解决方案，它的工作原理对我好你可能想试试看。这是您需要做的，

将您的URI编码保留为Latin-1。在Tomcat上，在server.xml中的连接器中添加URIEncoding =ISO-8859-1。

如果您需要手动URL解码，请将Latin1作为字符集使用。

使用fixEncoding（）函数修复编码。

例如，从查询中获取参数字符串

  String name = fixEncoding（request.getParameter（name））;

你可以永远这样做。具有正确编码的字符串不会更改。

代码已附加。祝你好运！

  public static String fixEncoding（String latin1）{
 try {
 byte [] bytes = latin1.getBytes（ISO-8859-1）; 
 if（！validUTF8（bytes））
 return latin1; 
 return new String（bytes，UTF-8）; 
} catch（UnsupportedEncodingException e）{
 //不可能，不取消选中
 throw new IllegalStateException（No Latin1 or UTF-8：+ e.getMessage（））; 
} 
 
} 
 
 public static boolean validUTF8（byte [] input）{
 int i = 0; 
 //检查BOM 
 if（input.length> = 3&&（input [0]& 0xFF）== 0xEF 
&&（input [1 ]& 0xFF）== 0xBB&（input [2]& 0xFF）== 0xBF）{
i = 3; 
} 
 
 int end; 
 for（int j = input.length; i< j; ++ i）{
 int octet = input [i]; 
 if（（octet& 0x80）== 0）{
 continue; // ASCII 
} 
 
 //检查UTF-8前导字节
 if（（octet& 0xE0）== 0xC0）{
 end = i + 1; 
} else if（（octet& 0xF0）== 0xE0）{
 end = i + 2; 
} else if（（octet& 0xF8）== 0xF0）{
 end = i + 3; 
} else {
 // Java只支持BMP，所以3是max 
 return false; 
} 
 
 while（i< end）{
 i ++; 
 octet = input [i]; 
 if（（octet& 0xC0）！= 0x80）{
 //不是有效的尾字节
 return false; 
} 
} 
} 
返回true; 
}

编辑：您的方法由于各种原因不起作用。当有编码错误时，您不能指望从Tomcat获得的内容。有时候你会得到或？其他时候，你不会得到任何东西，getParameter（）返回null。假设你可以检查？，你的查询字符串包含有效的？？

此外，您不应拒绝任何请求。这不是您的用户的错误。正如我在原来的问题中提到的，浏览器可能会将UTF-8或Latin-1中的URL进行编码。用户无法控制你需要同时接受。将您的servlet更改为Latin-1将保留所有字符，即使它们是错误的，给我们一个机会解决它或将其丢弃。

我在这里发布的解决方案并不完美，但它是迄今为止发现的最好的解决方案。

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

    ...v=abcädef

if "ISO-8859-1" is selected, the sent query part looks like

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

...v=abc%C3%A4def

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

if character.getBytes()[0] equals 63 for '?'
if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

protected List< String > getNonUnicodeCharacters( String s ) {
  final List< String > result = new ArrayList< String >();
  for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
    final String character = s.substring( i , i + 1 );
    final boolean isOtherSymbol = 
      ( int ) Character.OTHER_SYMBOL
       == Character.getType( character.charAt( 0 ) );
    final boolean isNonUnicode = isOtherSymbol 
      && character.getBytes()[ 0 ] == ( byte ) 63;
    if ( isNonUnicode )
      result.add( character );
  }
  return result;
}

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

final String[] test = new String[]{
  "v=abc%E4def",
  "v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
    System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
    System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}

This prints:

v=abc?def
v=abcädef
v=abcädef
v=abcÃ¤def

and it does not throw an IllegalArgumentException sigh

解决方案

I asked the same question,

Handling Character Encoding in URI on Tomcat

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
If you have to manually URL decode, use Latin1 as charset also.
Use the fixEncoding() function to fix up encodings.

For example, to get a parameter from query string,

  String name = fixEncoding(request.getParameter("name"));

You can do this always. String with correct encoding is not changed.

The code is attached. Good luck!

 public static String fixEncoding(String latin1) {
  try {
   byte[] bytes = latin1.getBytes("ISO-8859-1");
   if (!validUTF8(bytes))
    return latin1;   
   return new String(bytes, "UTF-8");  
  } catch (UnsupportedEncodingException e) {
   // Impossible, throw unchecked
   throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
  }

 }

 public static boolean validUTF8(byte[] input) {
  int i = 0;
  // Check for BOM
  if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
    && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
   i = 3;
  }

  int end;
  for (int j = input.length; i < j; ++i) {
   int octet = input[i];
   if ((octet & 0x80) == 0) {
    continue; // ASCII
   }

   // Check for UTF-8 leading byte
   if ((octet & 0xE0) == 0xC0) {
    end = i + 1;
   } else if ((octet & 0xF0) == 0xE0) {
    end = i + 2;
   } else if ((octet & 0xF8) == 0xF0) {
    end = i + 3;
   } else {
    // Java only supports BMP so 3 is max
    return false;
   }

   while (i < end) {
    i++;
    octet = input[i];
    if ((octet & 0xC0) != 0x80) {
     // Not a valid trailing byte
     return false;
    }
   }
  }
  return true;
 }

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

The solution I posted here is not perfect but it's the best one we found so far.

这篇关于如何确定一个String是否包含无效的编码字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何确定一个String是否包含无效的编码字符 [英] How to determine if a String contains invalid encoded characters

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何确定一个String是否包含无效的编码字符 [英] How to determine if a String contains invalid encoded characters

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭