从字符串中删除不可打印的utf8字符,但控制字符除外 [英] Remove non printable utf8 characters except controlchars from String

查看:198
本文介绍了从字符串中删除不可打印的utf8字符,但控制字符除外的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含文本,控制字符,数字,变音符号(德语)和其他utf8字符的字符串.

I've got a String containing text, control characters, digits, umlauts (german) and other utf8 characters.

我要去除所有不是语言一部分"的utf8字符.像(非完整列表):/\ßä,; \ n \ t"之类的特殊字符都应保留.

I want to strip all utf8 characters which are not "part of the language". Special characters like (non complete list) ":/\ßä,;\n \t" should all be preserved.

不幸的是,stackoverflow删除了所有这些字符,所以我必须附加一张图片(链接).

Sadly stackoverflow removes all those characters so I have to append a picture (link).

有什么想法吗?非常感谢您的帮助!

Any ideas? Help is very appreciated!

PS:如果有人知道粘贴服务不会杀死那些特殊字符,我会很乐意上传字符串..我只是找不到一个..

PS: If anybody does know a pasting service which does not kill those special characters I would happily upload the strings.. I just wasn't able to find one..

:我认为正则表达式"\ P {Cc}"是我要保留的所有字符.可以将此正则表达式反转,以便返回不匹配此正则表达式的所有字符吗?

: I THINK the regex "\P{Cc}" are all characters I want to PRESERVE. Could this regex be inverted so all characters not matching this regex be returned?

推荐答案

您已经找到Unicode字符属性.

You have already found Unicode character properties.

您可以通过更改前导"p"的大小写来反转字符属性

You can invert the character property, by changing the case of the leading "p"

例如

\p{L}匹配所有字母

\P{L}匹配所有不带属性字母的字符.

\P{L} matches all characters that does not have the property letter.

因此,如果您认为\P{Cc}是您所需要的,那么\p{Cc}将与之相反.

So if you think \P{Cc} is what you need, then \p{Cc} would match the opposite.

有关 regular-expressions.info

我很确定\p{Cc}接近您想要的内容,但请注意,它确实包括例如标签(0x09),换行(0x0A)和回车(0x0D).

I am quite sure \p{Cc} is close to what you want, but be careful, it does include, e.g. the tab (0x09), the Linefeed (0x0A) and the Carriage return (0x0D).

但是您可以创建自己的角色类,如下所示:

But you can create you own character class, like this:

[^\P{Cc}\t\r\n]

此类[^...]是一个否定的字符类,因此它将匹配所有不是非控制字符"的内容(双重否定,因此它与控制字符匹配),而不是制表符,CR和LF.

This class [^...] is a negated character class, so this would match everything that is not "Not control character" (double negation, so it matches control chars), and not tab, CR and LF.

这篇关于从字符串中删除不可打印的utf8字符,但控制字符除外的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆