使用正则表达式删除除中文字符以外的所有字符? [英] Remove all except the chinese characters with regex?

查看:1516
本文介绍了使用正则表达式删除除中文字符以外的所有字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用中文写的句子字符串.

I have a string that is a sentence, written in chinese.

其中包含汉字以及其他填充内容,例如空格,逗号,感叹号等,均以UTF8编码.

This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.

使用带有latin1字符串的正则表达式,我可以使用preg_replace[a-zA-Z]对其进行清洁并除去填充物.

Using regex with a latin1 string, I could use preg_replace and [a-zA-Z] to clean it and remove the filler.

在删除所有填充项时,如何在中文字符串中仅保留中文字母"字符?

How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?

推荐答案

根据本文档,以下是汉字的unicode范围:

According to this document, here are the unicode ranges of chinese characters:

表12-2.包含汉字表意文字的积木

Table 12-2. Blocks Containing Han Ideographs

Block                                Range         Comment
CJK Unified Ideographs               4E00–9FFF     Common
CJK Unified Ideographs Extension A   3400–4DBF     Rare
CJK Unified Ideographs Extension B   20000–2A6DF   Rare, historic
CJK Unified Ideographs Extension C   2A700–2B73F   Rare, historic
CJK Unified Ideographs Extension D   2B740–2B81F   Uncommon, some in current use
CJK Compatibility Ideographs         F900–FAFF     Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

您可以这样使用它:

preg_replace('/[^\u4E00-\u9FFF]+/u', '', $string);

preg_replace('/\P{Han}+/u', '', $string);

其中\P\p

有关所有unicode scripts

这篇关于使用正则表达式删除除中文字符以外的所有字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆