替换Unicode控制字符 [英] Replace Unicode Control Characters

查看：365 发布时间：2018/5/10 20:23:31 java regex google-maps unicode character-properties

本文介绍了替换Unicode控制字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想问一下Google Maps API v3，而Google似乎并不喜欢它这些字符。

示例： http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

该网址包含以下字符： http://www.fileformat.info/info/unicode/char/008f/index.htm 所以我收到一些数据，我需要对这些数据进行地理编码。我知道一些角色不会通过地理编码，但我不知道确切的列表。

我无法找到关于此问题的任何文档，所以我认为Google不喜欢的字符列表是这样的：
http://www.fileformat.info/info/unicode/category/Cc/list.htm

是否有任何已经构建的函数可以获取摆脱这些角色，还是我必须建立一个新的，并逐个替换？

或者是否有一个很好的正则表达式来完成这项工作？

还有人知道Google不喜欢哪个确切的字符列表吗？

编辑：Google创建了一个此网页：

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

解决方案

如果您想删除其他/控制Unicode类别中的所有字符，可以这样做：

  System.out.println（
a\\\b\\\c\\\d.replaceAll（\\p {Cc}） ，）
）; // abcd

请注意，这实际上会删除（其中包括）'\ u008f'字符串中的Unicode字符，而不是转义形式％8F string。

如果黑名单不能很好地被一个Unicode块/类别捕获，Java确实有一个强大的角色类算法，可以使用交叉点，减法等。或者，您也可以使用否定白名单方法，即不是明确指定哪些字符是非法的，而是指定哪些是合法的，并且其他所有内容都将变为非法。

API链接

java.util.regex.Pattern

regular-expressions.info/Character Class

示例

以下是一个减法示例：

  System.out.println（
正则表达式：现在您有两个问题!!
 .replaceAll（[a-z&& amp ; [^ aeiou]]，_）
）; 
 // _e_u_a_ e___e__io__：_o_ _ou _a_e __o __o__e__ !!

[...] 是一个 character class 。像 [aeiou] 之类的东西匹配任何小写元音之一。 [^ ...] 是一个否定的字符类。 [^ aeiou] 与中的任何一个匹配，而是小写元音。

[a-z&& [^ aeiou]] 匹配 [az] 减去 [aeiou ]

下面的例子显示了否定的白名单方法： .replaceAll（[^ az]，_ $ ））; //常规表达式只有小写字母 az 是合法的;一切都是非法的。 I need to replace all special control character in a string in Java. I want to ask the Google maps API v3, and Google doesn't seems to like these characters. Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list. I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't like is this one: http://www.fileformat.info/info/unicode/category/Cc/list.htm Is there any already built function to get rid of these characters, or do I have to build a new one, with a replace one by one? Or is there a good regexp to do the job done? And does somebody know which exact list of characters Google doesn't like? Edit : Google have create a webpage for this : https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs 解决方案 If you want to delete all characters in Other/Control Unicode category, you can do something like this: System.out.println( "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "") ); // abcd Note that this actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string. If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal. API links java.util.regex.Pattern regular-expressions.info/Character Class Examples Here's a subtraction example: System.out.println( "regular expressions: now you have two problems!!" .replaceAll("[a-z&&[^aeiou]]", "_") ); // _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!! The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels. [a-z&&[^aeiou]] matches [a-z] subtracted by [aeiou], i.e. all lowercase consonants. The next example shows the negated whitelist approach: System.out.println( "regular expressions: now you have two problems!!" .replaceAll("[^a-z]", "_") ); // regular_expressions__now_you_have_two_problems__ Only lowercase letters a-z are legal; everything else is illegal. 这篇关于替换Unicode控制字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

替换Unicode控制字符 [英] Replace Unicode Control Characters

问题描述

API链接

示例

API links

Examples

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

替换Unicode控制字符 [英] Replace Unicode Control Characters

问题描述

API链接

示例

API links

Examples

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭