从字符串中删除非 utf8 字符 [英] Remove non-utf8 characters from string

查看:26
本文介绍了从字符串中删除非 utf8 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在从字符串中删除非 utf8 字符时遇到问题,这些字符显示不正确.字符是这样的 0x97 0x61 0x6C 0x6F(十六进制表示)

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

删除它们的最佳方法是什么?正则表达式还是别的什么?

What is the best way to remove them? Regular expression or something else ?

推荐答案

使用正则表达式:

$regex = <<<'END'
/
  (
    (?: [x00-x7F]                 # single-byte sequences   0xxxxxxx
    |   [xC0-xDF][x80-xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [xE0-xEF][x80-xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [xF0-xF7][x80-xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '$1', $text);

它搜索 UTF-8 序列,并将其捕获到组 1 中.它还匹配无法识别为 UTF-8 序列一部分的单个字节,但不会捕获这些字节.替换是捕获到组 1 中的任何内容.这有效地删除了所有无效字节.

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

可以通过将无效字节编码为 UTF-8 字符来修复字符串.但如果错误是随机的,这可能会留下一些奇怪的符号.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
  (
    (?: [x00-x7F]               # single-byte sequences   0xxxxxxx
    |   [xC0-xDF][x80-xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [xE0-xEF][x80-xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [xF0-xF7][x80-xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [x80-xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [xC0-xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

  • !empty(x) 将匹配非空值("0" 被认为是空的).
  • x != "" 将匹配非空值,包括 "0".
  • x !== "" 将匹配除 "" 之外的任何内容.
  • !empty(x) will match non-empty values ("0" is considered empty).
  • x != "" will match non-empty values, including "0".
  • x !== "" will match anything except "".

x != "" 在这种情况下似乎是最好的使用.

x != "" seem the best one to use in this case.

我也加快了比赛速度.它不是单独匹配每个字符,而是匹配有效的 UTF-8 字符序列.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

这篇关于从字符串中删除非 utf8 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆