替换为中文内容后的 Python \ufffd [英] Python \ufffd after replacement with Chinese content

查看：46 发布时间：2021/6/26 20:16:42 python regex python-2.7

本文介绍了替换为中文内容后的 Python \ufffd的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我们找到这个问题的答案后，我们面临着下一个不寻常的替代行为:

我们的正则表达式是:

[\\((\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[)\\)\\]}】]+

我们正在尝试匹配任何类型的括号内的所有内容，包括括号原文为:

物理化学名校考研真题详解(理工科考研辅导系列(化学生物类))

结果是:

物 研真题详解

替换代码为:

 分隔符 = ' '如果本地化 == 'CN':分隔符 = ''p = re.compile(codecs.encode(unicode(regex), "utf-8"), flags=re.I)columnString = (p.sub(delimiter, columnString).strip()

为什么会出现 ( \ufffd) 字符以及如何修复这种行为?

我们在使用正则表达式时遇到的同样问题:

(\\d*[满|元])打印 repr(columnString)='\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'print repr(regex)=u'[\\(\uff08\\[{\u3010]+(\\w+|\\s+|\\S+|\\W+)?[\uff09\\)\\]}\u3011]+'print repr(p.pattern)='[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'

解决方案

您不应混合使用 UTF-8 和正则表达式.将所有文本处理为 Unicode.确保首先将正则表达式和输入字符串解码为unicode值:

<预><代码>>>>进口重新>>>columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'>>>正则表达式 = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'>>>utf8_compiled = re.compile(regex, flags=re.I)>>>utf8_compiled.sub('', columnString)'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'>>>打印 utf8_compiled.sub('', columnString).decode('utf8', 'replace')当代骨>>>unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)>>>unicode_compiled.sub('', columnString.decode('utf8'))u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'>>>打印 unicode_compiled.sub('', columnString.decode('utf8'))当代骨伤科妙方>>>print unicode_compiled.sub('', u'物理化学名考研真题详解(理工科考研辅导系列(化学生物类)')物理化学名校考研真题详解

在您的模式中使用 UTF-8 时，单独的字节用于 【 代码点:

<预><代码>>>>'【''\xe3\x80\x90'

这意味着您的字符类匹配任何这些字节；\xe3 或 \x80 或 \x90 是该字符类中每个单独的有效字节.

After we found the answer to this question we are faced with next unusual replacement behavior:

Our regex is:

[\\(（\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[）\\)\\]}】]+

We are trying to match all content inside any type of brackets including the brackets The original text is:

物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))

The result is:

物�研真题详解

The code for the replacement is:

 delimiter = ' '
 if localization == 'CN':
        delimiter = ''
  p = re.compile(codecs.encode(unicode(regex), "utf-8"), flags=re.I)
  columnString = (p.sub(delimiter, columnString).strip()

Why � ( \ufffd) character appear and how to fix such behavior?

Same problem we are faced when we used regex:

(\\d*[满|元])

print repr(columnString)='\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'

print repr(regex)=u'[\\(\uff08\\[{\u3010]+(\\w+|\\s+|\\S+|\\W+)?[\uff09\\)\\]}\u3011]+'

print repr(p.pattern)='[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'

解决方案

You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:

>>> import re
>>> columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'
>>> regex = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'
>>> utf8_compiled = re.compile(regex, flags=re.I)
>>> utf8_compiled.sub('', columnString)
'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'
>>> print utf8_compiled.sub('', columnString).decode('utf8', 'replace')
当代骨�
>>> unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)
>>> unicode_compiled.sub('', columnString.decode('utf8'))
u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'
>>> print unicode_compiled.sub('', columnString.decode('utf8'))
当代骨伤科妙方
>>> print unicode_compiled.sub('', u'物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))')
物理化学名校考研真题详解

When using UTF-8 in your pattern consists of separate bytes for the 【 codepoint:

>>> '【'
'\xe3\x80\x90'

which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.

这篇关于替换为中文内容后的 Python \ufffd的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

替换为中文内容后的 Python \ufffd [英] Python \ufffd after replacement with Chinese content

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

替换为中文内容后的 Python \ufffd [英] Python \ufffd after replacement with Chinese content

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭