从文本中删除所有无效字符(例如\ uf0b7) [英] Removing all invalid characters (e.g. \uf0b7) from text

查看:144
本文介绍了从文本中删除所有无效字符(例如\ uf0b7)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有几个文本,其中有时包含字符无效字符",例如\ uf0b7或\ uf077.我没有办法知道特定文本可能包含哪些无效字符代码,并且我想知道是否有一种方法可以确保清除所有类型的无效字符"的字符串,因为稍后会有一个过程(取决于第三方程序包)无法接收包含它的字符串.

I currently have several text coming in which sometimes contains the character 'invalid character' e.g. \uf0b7 or \uf077. I don't have a way of knowing which of the invalid character codes a specific text might contain and I wondered if there was a way to make sure that a string is cleaned of all types of 'invalid character', since a process later on (which is dependent on a third party package) can not receive a string which contains it.

我尝试寻找解决方案,但我得到的只是关于人们要删除的常规字符(例如'^%$& *')的答案,这些字符被归类为无效字符,但是我想删除/替换所有形式的实际字符无效字符"

I've tried searching for a solution, but all I get it is answers regarding regular characters which people want removed (e.g. '^%$&*') which they have classified as invalid characters, however I want to remove/replace the actual character 'invalid character' in all its forms

推荐答案

我遇到了类似的问题.事实证明专用区域字符是在 Co 常规类别中,由< unicodedata 中的code> category().

I had a similar issue. It turns out private use areas characters are in the Co general category, as returned by category() in unicodedata.

我解决了以下问题:

import unicodedata

def is_pua(c):
    return unicodedata.category(c) == 'Co'

content = "This\uf0b7 is a \uf0b7string \uf0c7with private \uf0b7use are\uf0a7as blocks\uf0d7." 

"".join([char for char in content if not is_pua(char)])

这将输出:

'This is a string with private use areas blocks.'

这篇关于从文本中删除所有无效字符(例如\ uf0b7)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆