openpyxl - 将单元格值从“utf-8"转换为“ascii"; [英] openpyxl - convert cell value from 'utf-8 'to ''ascii"

查看:105
本文介绍了openpyxl - 将单元格值从“utf-8"转换为“ascii";的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我试图将单元格的值转换为可用的字符串.

我想要做的是在正则表达式中使用单元格值,但它不断抛出错误

<块引用>

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 26: ordinal not in range(128)

这只是众多问题中的一个,因为当我将它从 unicode 转换为 ascii 值时,另一个单元格给了我 datetime 错误(就像它在 datetime 中一样).

关于如何将其转换为字符串以便它可以在正则表达式中使用的任何建议,因为这些值是可打印的.

解决方案

我不明白这一点,为什么你必须从 utf-8 转换.

<块引用>

来自 unicode 文档:
UTF-8 使用以下规则:

如果代码点是<128,用对应的字节值表示.如果代码点 >= 128,则将其转换为两个、三个或四个字节的序列,其中该序列的每个字节介于 128 和 255 之间.

您可以将其转换为ascii,例如:

u.encode('utf-8') = b"\xea\x80\x80abcd\xde\xb4 u'\\u2019'=\xe2\x80\x99"u.encode('ascii', 'ignore') = b"abcd u'\\u2019'="u.encode('ascii', 'replace') = b"?abcd? u'\\u2019'=?"u.encode('ascii', 'xmlcharrefreplace') = b"&#40960;abcd&#1972; u'\\u2019'=&#8217;"u.encode('ascii', 'backslashreplace') = b"\\ua000abcd\\u07b4 u'\\u2019'=\\u2019"

<小时><块引用>

来自 re 文档:
要搜索的模式和字符串都可以是 Unicode 字符串,也可以是 8 位字符串.但是,Unicode 字符串和 8 位字符串不能混合使用:即不能将 Unicode 字符串与字节模式匹配,反之亦然;同样,当要求替换时,替换字符串必须与模式和搜索字符串的类型相同.

re.Are.ASCII使 \w、\W、\b、\B、\d、\D、\s 和 \S 执行仅 ASCII 匹配而不是完整的 Unicode 匹配.这仅对 Unicode 模式有意义,而对于字节模式则被忽略.请注意,为了向后兼容,re.U 标志仍然存在(以及它的同义词 re.UNICODE 及其嵌入的对应物 (?u)),但这些在 Python 3 中是多余的,因为字符串的默认匹配是 Unicode(并且字节不允许 Unicode 匹配).

使用 Python:3.4.2 测试

So I'm trying to convert a cell's value into a usable string.

What I'm trying to do is use the cell value in regex, but it keeps throwing the error

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 26: ordinal not in range(128)

This is just one of many problems, as when I do convert it from a unicode to an ascii value, another cell gives me datetime error (as it is in datetime).

Any advice on how to convert this to a string so that it can be used in regex, since these values are printable.

解决方案

I don't see the point, why you have to convert from utf-8.

From the unicode docs:
UTF-8 uses the following rules:

If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.  

You can convert it to ascii, for instance:

u.encode('utf-8') = b"\xea\x80\x80abcd\xde\xb4 u'\\u2019'=\xe2\x80\x99"
u.encode('ascii', 'ignore') = b"abcd u'\\u2019'="
u.encode('ascii', 'replace') = b"?abcd? u'\\u2019'=?"
u.encode('ascii', 'xmlcharrefreplace') = b"&#40960;abcd&#1972; u'\\u2019'=&#8217;"
u.encode('ascii', 'backslashreplace') = b"\\ua000abcd\\u07b4 u'\\u2019'=\\u2019"  


From the re docs:
Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

re.A
re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching.
This is only meaningful for Unicode patterns, and is ignored for byte patterns.

Note that for backward compatibility, the re.U flag still exists 
(as well as its synonym re.UNICODE and its embedded counterpart (?u)), 
but these are redundant in Python 3 since matches are Unicode by default for strings 
(and Unicode matching isn’t allowed for bytes).

Tested with Python:3.4.2

这篇关于openpyxl - 将单元格值从“utf-8"转换为“ascii";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆