删除utf中的垃圾字符 [英] Remove garbage characters in utf
问题描述
我正在使用utf8格式将所有数据存储到mysql中.在将数据插入数据库之前,我需要使用不需要的字符清除字符串.字符串采用utf8格式.我知道如何使用正则表达式和字符串替换,但不知道如何使用阿拉伯字符.
I am using utf8 format to store all my data into mysql. Before data is inserted into the database I need to clean the strings with unwanted characters. The strings are in utf8 format. I know how to use regex and string replace but do not know how to work with arabic characters.
需要清洗的样品串:████.. الــقــوانين الجديـــدةفيقســـم الـعنايـ";
Sample string that needs to be cleaned : "████ .. الــقــوانين الجديـــدة في قســـم الـعنايـ";
谢谢
推荐答案
好.如 @Jonathan Leffler 所述,如果您可以为需要替换的字符指定unicode字符范围,您可以使用正则表达式将字符替换为空字符串.
Ok. As @Jonathan Leffler already said, if you can specify the unicode character ranges for the characters that need to be replaced, you can use a regular expression to replace the characters with an empty string.
在表达式中(在PHP中)将Unicode字符指定为<c0>.另外,您必须设置 u
修饰符使PHP将模式视为UTF8.
A unicode character is specified as \x{FFFF}
in an expression (in PHP). In addition, you have to set the u
modifier to make PHP treat the pattern as UTF8.
所以最后,您将得到如下内容:
So in the end, you have something like this:
preg_replace('/[\x{FFFF}-\x{FFFF}]+/u','',$string);
其中
-
/.../u
是定界符加上修饰符 -
[...]+
是字符类加量词,表示在一个或多个时间段内匹配任何这些字符 -
\x{FFFF}-\x{FFFF}
是Unicode字符范围(显然,您必须提供正确的代码点/字符编号).
/.../u
are the delimiters plus the modifier[...]+
is a character class plus quantifier, which means match any of these characters inside one or mor times\x{FFFF}-\x{FFFF}
is a unicode character range (obviously you have to provide the right codepoints/numbers of the characters).
您还可以通过^
否定该组,您可以指定要保留的范围:
You can also negate the group with a ^
you can specify the range which you want to keep:
preg_replace('/[^\x{FFFF}-\x{FFFF}]+/u','',$string);
更多信息:
- Regular expressions
- Regular expressions in PHP
- Unicode Charts
这篇关于删除utf中的垃圾字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!