使用正则表达式删除除中文字符以外的所有字符? [英] Remove all except the chinese characters with regex?
问题描述
我有一个用中文写的句子字符串.
I have a string that is a sentence, written in chinese.
其中包含汉字以及其他填充内容,例如空格,逗号,感叹号等,均以UTF8编码.
This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.
使用带有latin1字符串的正则表达式,我可以使用preg_replace
和[a-zA-Z]
对其进行清洁并除去填充物.
Using regex with a latin1 string, I could use preg_replace
and [a-zA-Z]
to clean it and remove the filler.
在删除所有填充项时,如何在中文字符串中仅保留中文字母"字符?
How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?
推荐答案
根据本文档,以下是汉字的unicode范围:
According to this document, here are the unicode ranges of chinese characters:
表12-2.包含汉字表意文字的积木
Table 12-2. Blocks Containing Han Ideographs
Block Range Comment
CJK Unified Ideographs 4E00–9FFF Common
CJK Unified Ideographs Extension A 3400–4DBF Rare
CJK Unified Ideographs Extension B 20000–2A6DF Rare, historic
CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
CJK Compatibility Ideographs F900–FAFF Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants
您可以这样使用它:
preg_replace('/[^\u4E00-\u9FFF]+/u', '', $string);
或
preg_replace('/\P{Han}+/u', '', $string);
其中\P
是\p
有关所有unicode scripts
这篇关于使用正则表达式删除除中文字符以外的所有字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!