从 unicode 字符串中去除特殊字符和标点符号 [英] Strip special characters and punctuation from a unicode string

查看:120
本文介绍了从 unicode 字符串中去除特殊字符和标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从可能包含非 ascii 字母的 unicode 字符串中删除标点符号.我尝试使用 regex 模块:

导入正则表达式text = u"<Üäik>"regex.sub(ur"\p{P}+", "", 文本)

但是,我注意到字符 <> 没有被删除.有谁知道为什么,还有其他方法可以从 unicode 字符串中去除标点符号吗?

我尝试过的另一种方法是:

导入字符串text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

但我想避免将文本从 unicode 转换为字符串和向后转换.

解决方案

<> 被归类为 数学符号 (Sm),而不是标点符号 (P).您可以匹配:

regex.sub('[\p{P}\p{Sm}]+', '', text)

unicode.translate() 方法也存在,它采用字典将整数(代码点)映射到其他整数代码点、unicode 字符或 NoneNone 删除该代码点.使用 ord()string.punctuation 映射到代码点:

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

这只会删除有限数量的 ASCII 标点符号.

演示:

<预><代码>>>>导入正则表达式>>>text = u"<Üäik>">>>打印 regex.sub('[\p{P}\p{Sm}]+', '', text)于艾克>>>导入字符串>>>打印 text.translate(dict.fromkeys(ord(c) for c in string.punctuation))于艾克

如果string.punctuation还不够,那么你可以为所有P生成一个完整的str.translate()映射>Sm 通过从 0 迭代到 sys.maxunicode 的代码点,然后针对 unicodedata.category():

<预><代码>>>>导入系统,unicodedata>>>toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))>>>打印 text.translate(toremove)于艾克

(对于 Python 3,将 unicode 替换为 str,将 print ... 替换为 print(...)).

I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex module:

import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)

However, I've noticed that the characters < and > don't get removed. Does anyone know why and is there any other way to strip punctuation from unicode strings?

EDIT: Another approach I've tried out is doing:

import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

but I would like to avoid converting the text from unicode to string and backwards.

解决方案

< and > are classified as Math Symbols (Sm), not Punctuation (P). You can match either:

regex.sub('[\p{P}\p{Sm}]+', '', text)

The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None; None removes that codepoint. Map string.punctuation to codepoints with ord():

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

That only removes only the limited number of ASCII punctuation characters.

Demo:

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode, then test those values against unicodedata.category():

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

(For Python 3, replace unicode with str, and print ... with print(...)).

这篇关于从 unicode 字符串中去除特殊字符和标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆