替换Unicode控制字符 [英] Replace Unicode Control Characters

查看:365
本文介绍了替换Unicode控制字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我想问一下Google Maps API v3,而Google似乎并不喜欢它这些字符。



示例: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F



该网址包含以下字符: http://www.fileformat.info/info/unicode/char/008f/index.htm 所以我收到一些数据,我需要对这些数据进行地理编码。我知道一些角色不会通过地理编码,但我不知道确切的列表。



我无法找到关于此问题的任何文档,所以我认为Google不喜欢的字符列表是这样的:
http://www.fileformat.info/info/unicode/category/Cc/list.htm



是否有任何已经构建的函数可以获取摆脱这些角色,还是我必须建立一个新的,并逐个替换?



或者是否有一个很好的正则表达式来完成这项工作?



还有人知道Google不喜欢哪个确切的字符列表吗?



编辑:Google创建了一个此网页:



https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

解决方案

如果您想删除其他/控制Unicode类别中的所有字符,可以这样做:

  System.out.println(
a\\\b\\\c\\\d.replaceAll(\\p {Cc}) ,)
); // abcd

请注意,这实际上会删除(其中包括)'\ u008f'字符串中的Unicode字符,而不是转义形式%8F string。



如果黑名单不能很好地被一个Unicode块/类别捕获,Java确实有一个强大的角色类算法,可以使用交叉点,减法等。或者,您也可以使用否定白名单方法,即不是明确指定哪些字符是非法的,而是指定哪些是合法的,并且其他所有内容都将变为非法。



API链接








示例



以下是一个减法示例:

  System.out.println(
正则表达式:现在您有两个问题!!
.replaceAll([a-z&& amp ; [^ aeiou]],_)
);
// _e_u_a_ e___e__io__:_o_ _ou _a_e __o __o__e__ !!

[...] 是一个 character class 。像 [aeiou] 之类的东西匹配任何小写元音之一。 [^ ...] 是一个否定的字符类。 [^ aeiou] 中的任何一个匹配,而是小写元音。

[a-z&& [^ aeiou]] 匹配 [az] 减去 [aeiou ]



下面的例子显示了否定的白名单方法:


.replaceAll([^ az],_ $ )
);
//常规表达式只有小写字母 az 是合法的;一切都是非法的。


I need to replace all special control character in a string in Java.

I want to ask the Google maps API v3, and Google doesn't seems to like these characters.

Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm

So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list.

I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't like is this one: http://www.fileformat.info/info/unicode/category/Cc/list.htm

Is there any already built function to get rid of these characters, or do I have to build a new one, with a replace one by one?

Or is there a good regexp to do the job done?

And does somebody know which exact list of characters Google doesn't like?

Edit : Google have create a webpage for this :

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

解决方案

If you want to delete all characters in Other/Control Unicode category, you can do something like this:

    System.out.println(
        "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
    ); // abcd

Note that this actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.

API links


Examples

Here's a subtraction example:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[a-z&&[^aeiou]]", "_")
    );
    //   _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

[a-z&&[^aeiou]] matches [a-z] subtracted by [aeiou], i.e. all lowercase consonants.

The next example shows the negated whitelist approach:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[^a-z]", "_")
    );
    //   regular_expressions__now_you_have_two_problems__

Only lowercase letters a-z are legal; everything else is illegal.

这篇关于替换Unicode控制字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆