从python中的字符串中剥离不可打印的字符 [英] Stripping non printable characters from a string in python

查看:174
本文介绍了从python中的字符串中剥离不可打印的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我曾经跑步

$s =~ s/[^[:print:]]//g;

在Perl上使用

摆脱不可打印的字符.

on Perl to get rid of non printable characters.

在Python中,没有POSIX正则表达式类,并且我不能写[:print:]来表示我想要的意思.我不知道在Python中无法检测字符是否可打印.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

你会怎么做?

它也必须支持Unicode字符. string.printable方式会很乐意将它们从输出中剥离. curses.ascii.isprint将为任何Unicode字符返回false.

It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

推荐答案

不幸的是,在Python中遍历字符串相当慢.对于这种事情,正则表达式的速度要快一个数量级.您只需要自己构建角色类即可. unicodedata 模块对此非常有帮助,尤其是 unicodedata.category()函数.有关类别的说明,请参见 Unicode字符数据库.

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于某些用例,附加的类别(例如,来自 control 组的所有类别)可能更可取,尽管这可能会减慢处理时间并显着增加内存使用.每个类别的字符数:

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc(控件):65
  • Cf(格式):161
  • Cs(代理):2048
  • Co(私人使用):137468
  • Cn(未分配):836601
  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

编辑从评论中添加建议.

Edit Adding suggestions from the comments.

这篇关于从python中的字符串中剥离不可打印的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆