从文本中删除所有无效字符(例如\ uf0b7) [英] Removing all invalid characters (e.g. \uf0b7) from text
问题描述
我目前有几个文本,其中有时包含字符无效字符",例如\ uf0b7或\ uf077.我没有办法知道特定文本可能包含哪些无效字符代码,并且我想知道是否有一种方法可以确保清除所有类型的无效字符"的字符串,因为稍后会有一个过程(取决于第三方程序包)无法接收包含它的字符串.
I currently have several text coming in which sometimes contains the character 'invalid character' e.g. \uf0b7 or \uf077. I don't have a way of knowing which of the invalid character codes a specific text might contain and I wondered if there was a way to make sure that a string is cleaned of all types of 'invalid character', since a process later on (which is dependent on a third party package) can not receive a string which contains it.
我尝试寻找解决方案,但我得到的只是关于人们要删除的常规字符(例如'^%$& *')的答案,这些字符被归类为无效字符,但是我想删除/替换所有形式的实际字符无效字符"
I've tried searching for a solution, but all I get it is answers regarding regular characters which people want removed (e.g. '^%$&*') which they have classified as invalid characters, however I want to remove/replace the actual character 'invalid character' in all its forms
推荐答案
我遇到了类似的问题.事实证明专用区域字符是在 Co
常规类别中,由< unicodedata
中的code> category().
I had a similar issue. It turns out private use areas characters are in the Co
general category, as returned by category()
in unicodedata
.
我解决了以下问题:
import unicodedata
def is_pua(c):
return unicodedata.category(c) == 'Co'
content = "This\uf0b7 is a \uf0b7string \uf0c7with private \uf0b7use are\uf0a7as blocks\uf0d7."
"".join([char for char in content if not is_pua(char)])
这将输出:
'This is a string with private use areas blocks.'
这篇关于从文本中删除所有无效字符(例如\ uf0b7)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!