从文本中删除所有无效字符(例如\ uf0b7) [英] Removing all invalid characters (e.g. \uf0b7) from text

查看：144 发布时间：2021/4/28 20:45:34 python python-3.x string nlp data-cleaning

本文介绍了从文本中删除所有无效字符(例如\ uf0b7)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前有几个文本，其中有时包含字符无效字符"，例如\ uf0b7或\ uf077.我没有办法知道特定文本可能包含哪些无效字符代码，并且我想知道是否有一种方法可以确保清除所有类型的无效字符"的字符串，因为稍后会有一个过程(取决于第三方程序包)无法接收包含它的字符串.

I currently have several text coming in which sometimes contains the character 'invalid character' e.g. \uf0b7 or \uf077. I don't have a way of knowing which of the invalid character codes a specific text might contain and I wondered if there was a way to make sure that a string is cleaned of all types of 'invalid character', since a process later on (which is dependent on a third party package) can not receive a string which contains it.

我尝试寻找解决方案，但我得到的只是关于人们要删除的常规字符(例如'^％$& *')的答案，这些字符被归类为无效字符，但是我想删除/替换所有形式的实际字符无效字符"

I've tried searching for a solution, but all I get it is answers regarding regular characters which people want removed (e.g. '^%$&*') which they have classified as invalid characters, however I want to remove/replace the actual character 'invalid character' in all its forms

推荐答案

我遇到了类似的问题.事实证明专用区域字符是在 Co 常规类别中，由< unicodedata 中的code> category().

I had a similar issue. It turns out private use areas characters are in the Co general category, as returned by category() in unicodedata.

我解决了以下问题:

import unicodedata

def is_pua(c):
    return unicodedata.category(c) == 'Co'

content = "This\uf0b7 is a \uf0b7string \uf0c7with private \uf0b7use are\uf0a7as blocks\uf0d7." 

"".join([char for char in content if not is_pua(char)])

这将输出:

'This is a string with private use areas blocks.'

这篇关于从文本中删除所有无效字符(例如\ uf0b7)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从文本中删除所有无效字符(例如\ uf0b7) [英] Removing all invalid characters (e.g. \uf0b7) from text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从文本中删除所有无效字符(例如\ uf0b7) [英] Removing all invalid characters (e.g. \uf0b7) from text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭