从python中的字符串中删除控制字符 [英] Removing control characters from a string in python
问题描述
我目前有以下代码
def removeControlCharacters(line):我 = 0对于 c 行:如果 (c < chr(32)):line = line[:i - 1] + line[i+1:]我 += 1回程线
如果要删除的字符超过一个,这将不起作用.
unicode 中有数百个控制字符.如果您正在清理来自网络或其他可能包含非 ascii 字符的数据源,您将需要 Python 的 unicodedata 模块.unicodedata.category(…)
函数返回 unicode 类别代码(例如,控制字符、空格、字母等)的任何字符.对于控制字符,类别总是以C"开头.
此代码段从字符串中删除所有控制字符.
导入 unicodedatadef remove_control_characters(s):return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
<小时>
<预><代码>>>>从 unicodedata 导入类别>>>category('\r') # 回车 -->抄送:控制字符'抄送'>>>category('\0') # 空字符 --->抄送:控制字符'抄送'>>>category('\t') # tab -------------->抄送:控制字符'抄送'>>>category(' ') # 空格 ------------>Zs : 分隔符,空格'Z'>>>category(u'\u200A') # 头发空间 ------->Zs : 分隔符,空格'Z'>>>category(u'\u200b') # 零宽度空间 ->Cf : 控制字符,格式'CF'>>>category('A') # 字母 "A" ------->Lu : 字母,大写'鲁'>>>category(u'\u4e21') # 両 --------------->Lo : 信件,其他'罗'>>>category(',') # 逗号 ----------->Po : 标点符号'宝'>>>I currently have the following code
def removeControlCharacters(line):
i = 0
for c in line:
if (c < chr(32)):
line = line[:i - 1] + line[i+1:]
i += 1
return line
This is just does not work if there are more than one character to be deleted.
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…)
function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
这篇关于从python中的字符串中删除控制字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!