从String中删除非ASCII不可打印字符 [英] Remove non-ASCII non-printable characters from a String

查看:295
本文介绍了从String中删除非ASCII不可打印字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到的用户输入包括非ASCII字符和不可打印的字符,例如

I get user input including non-ASCII characters and non-printable characters, such as

\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0

例如:

email : abc@gmail.com\xa0\xa0
street : 123 Main St.\xc2\xa0

期望的输出:

  email : abc@gmail.com
  street : 123 Main St.

什么是使用Java删除它们的最佳方法?

我尝试了以下操作,但似乎无法正常工作

What is the best way to removing them using Java?
I tried the following, but doesn't seem to work

public static void main(String args[]) throws UnsupportedEncodingException {
        String s = "abc@gmail\\xe9.com";
        String email = "abc@gmail.com\\xa0\\xa0";

        System.out.println(s.replaceAll("\\P{Print}", ""));
        System.out.println(email.replaceAll("\\P{Print}", ""));
    }

输出

abc@gmail\xe9.com
abc@gmail.com\xa0\xa0


推荐答案

您的要求不明确。 Java String 中的所有字符都是Unicode字符,因此如果删除它们,您将留下空字符串。我假设您的意思是要删除任何非ASCII,不可打印的字符。

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\\P{Print}", "");

此处, \p {打印} 代表POSIX角色类对于可打印的ASCII字符,而 \P {Print} 是该类的补充。使用此表达式,可打印ASCII的所有字符都将替换为空字符串。 (额外的反斜杠是因为 \ 在字符串文字中启动转义序列。)

Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)

显然,所有输入字符实际上都是ASCII字符,表示不可打印或非ASCII字符的可打印编码。 Mongo应该对这些字符串没有任何问题,因为它们只包含普通的可打印ASCII字符。

Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

这对我来说听起来有点可疑。我认为发生的事情是数据确实包含非可打印和非ASCII字符,而另一个组件(如日志框架)正在用可打印的表示替换它们。在您的简单测试中,您无法将可打印表示转换回原始字符串,因此您错误地认为第一个正则表达式不起作用。

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

这是我的猜测,但是如果我误读了这种情况,你真的需要删除文字 \ xHH 转义,你可以使用下面的正则表达式。

That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");






模式 类可以很好地列出Java的正则表达式库支持的所有语法。为了更详细地说明所有语法的含义,我发现 Regular-Expressions.info网站非常很有帮助。

这篇关于从String中删除非ASCII不可打印字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆