如何删除字符串中的所有不可打印字符? [英] How to remove all non printable characters in a string?

查看:24
本文介绍了如何删除字符串中的所有不可打印字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想我需要删除字符 0-31 和 127.

I imagine I need to remove chars 0-31 and 127.

是否有一个函数或一段代码可以有效地做到这一点?

Is there a function or piece of code to do this efficiently?

推荐答案

7 位 ASCII?

如果您的 Tardis 刚刚于 1963 年登陆,并且您只想要 7 位可打印的 ASCII 字符,您可以使用以下命令删除 0-31 和 127-255 之间的所有内容:

7 bit ASCII?

If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this:

$string = preg_replace('/[x00-x1Fx7F-xFF]/', '', $string);

它匹配 0-31、127-255 范围内的任何内容并将其删除.

It matches anything in range 0-31, 127-255 and removes it.

你掉进了热水浴缸时光机,你又回到了八十年代.如果您有某种形式的 8 位 ASCII,那么您可能希望将字符保持在 128-255 的范围内.一个简单的调整 - 只需寻找 0-31 和 127

You fell into a Hot Tub Time Machine, and you're back in the eighties. If you've got some form of 8 bit ASCII, then you might want to keep the chars in range 128-255. An easy adjustment - just look for 0-31 and 127

$string = preg_replace('/[x00-x1Fx7F]/', '', $string);

UTF-8?

啊,欢迎回到 21 世纪.如果你有一个 UTF-8 编码的字符串,那么 /u 修饰符 可用于正则表达式

$string = preg_replace('/[x00-x1Fx7F]/u', '', $string);

这只是删除了 0-31 和 127.这适用于 ASCII 和 UTF-8,因为两者共享 相同的控制集范围(如下面的 mgutt 所述).严格来说,这可以在没有 /u 修饰符的情况下工作.但是如果你想删除其他字符,它会让生活更轻松......

This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range (as noted by mgutt below). Strictly speaking, this would work without the /u modifier. But it makes life easier if you want to remove other chars...

如果您正在处理 Unicode,则有可能有许多非- 打印元素,但让我们考虑一个简单的:NO-BREAK SPACE (U+00A0)

If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)

在 UTF-8 字符串中,这将被编码为 0xC2A0.您可以查找并删除该特定序列,但是使用 /u 修饰符后,您只需将 xA0 添加到字符类:

In a UTF-8 string, this would be encoded as 0xC2A0. You could look for and remove that specific sequence, but with the /u modifier in place, you can simply add xA0 to the character class:

$string = preg_replace('/[x00-x1Fx7FxA0]/u', '', $string);

附录:str_replace 怎么样?

preg_replace 非常有效,但如果您经常执行此操作,您可以构建一个要删除的字符数组,并使用 str_replace 如下面的 mgutt 所述,例如

Addendum: What about str_replace?

preg_replace is pretty efficient, but if you're doing this operation a lot, you could build an array of chars you want to remove, and use str_replace as noted by mgutt below, e.g.

//build an array we can re-use across several operations
$badchar=array(
    // control characters
    chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
    chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
    chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
    chr(31),
    // non-printing characters
    chr(127)
);

//replace the unwanted chars
$str2 = str_replace($badchar, '', $str);

直觉上,这似乎很快,但情况并非总是如此,您绝对应该进行基准测试,看看它是否能为您节省任何东西.我使用随机数据对各种字符串长度进行了一些基准测试,并且使用 php 7.0.12 出现了这种模式

Intuitively, this seems like it would be fast, but it's not always the case, you should definitely benchmark to see if it saves you anything. I did some benchmarks across a variety string lengths with random data, and this pattern emerged using php 7.0.12

     2 chars str_replace     5.3439ms preg_replace     2.9919ms preg_replace is 44.01% faster
     4 chars str_replace     6.0701ms preg_replace     1.4119ms preg_replace is 76.74% faster
     8 chars str_replace     5.8119ms preg_replace     2.0721ms preg_replace is 64.35% faster
    16 chars str_replace     6.0401ms preg_replace     2.1980ms preg_replace is 63.61% faster
    32 chars str_replace     6.0320ms preg_replace     2.6770ms preg_replace is 55.62% faster
    64 chars str_replace     7.4198ms preg_replace     4.4160ms preg_replace is 40.48% faster
   128 chars str_replace    12.7239ms preg_replace     7.5412ms preg_replace is 40.73% faster
   256 chars str_replace    19.8820ms preg_replace    17.1330ms preg_replace is 13.83% faster
   512 chars str_replace    34.3399ms preg_replace    34.0221ms preg_replace is  0.93% faster
  1024 chars str_replace    57.1141ms preg_replace    67.0300ms str_replace  is 14.79% faster
  2048 chars str_replace    94.7111ms preg_replace   123.3189ms str_replace  is 23.20% faster
  4096 chars str_replace   227.7029ms preg_replace   258.3771ms str_replace  is 11.87% faster
  8192 chars str_replace   506.3410ms preg_replace   555.6269ms str_replace  is  8.87% faster
 16384 chars str_replace  1116.8811ms preg_replace  1098.0589ms preg_replace is  1.69% faster
 32768 chars str_replace  2299.3128ms preg_replace  2222.8632ms preg_replace is  3.32% faster

时间本身是 10000 次迭代,但更有趣的是相对差异.最多 512 个字符,我看到 preg_replace 总是赢.在 1-8kb 范围内,str_replace 具有边缘优势.

The timings themselves are for 10000 iterations, but what's more interesting is the relative differences. Up to 512 chars, I was seeing preg_replace alway win. In the 1-8kb range, str_replace had a marginal edge.

我认为这是一个有趣的结果,因此将其包含在此处.重要的不是拿这个结果来决定使用哪种方法,而是根据你自己的数据进行基准测试,然后再决定.

I thought it was interesting result, so including it here. The important thing is not to take this result and use it to decide which method to use, but to benchmark against your own data and then decide.

这篇关于如何删除字符串中的所有不可打印字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆