删除“编码大于3个字节的字符".使用Python 3 [英] Remove "characters with encodings larger than 3 bytes" using Python 3

查看:61
本文介绍了删除“编码大于3个字节的字符".使用Python 3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除编码大于3个字节的字符.因为当我将CSV数据上传到Amazon Mechanical Turk系统时,它会要求我这样做.

I want to remove characters with encodings larger than 3 bytes. Because when I upload my CSV data to Amazon Mechanical Turk system, it asks me to do it.

您的CSV文件需要使用UTF-8编码,并且不能包含字符编码大于3个字节.例如,一些非英语不允许使用字符(了解更多信息).

Your CSV file needs to be UTF-8 encoded and cannot contain characters with encodings larger than 3 bytes. For example, some non-English characters are not allowed (learn more).

要解决此问题,我想做一个 filter_max3bytes 功能来删除Python3中的那些字符.

To overcome this problem, I want to make a filter_max3bytes funciton to remove those characters in Python3.

x = 'below ð\x9f~\x83,'
y = remove_max3byes(x)  # y=="below ~,"

然后我将应用该功能,然后再将其保存到UTF-8编码的CSV文件中.

Then I will apply the function before saving it to a CSV file, which is UTF-8 encoded.

这篇文章与我的问题有关,但是他们使用python 2,因此解决方案对我不起作用.

This post is related my problem, but they uses python 2 and the solution did not worked for me.

谢谢!

推荐答案

字符串中的所有字符似乎都不占用UTF-8中的3个字节:

None of the characters in your string seems to take 3 bytes in UTF-8:

x = 'below ð\x9f~\x83,'

无论如何,删除它们的方法是:

Anyway, the way to remove them, if there were any would be:

filtered_x = ''.join(char for char in x if len(char.encode('utf-8')) < 3)

例如(带有此类字符):

For example (with such characters):

>>> x = 'abcd漢字efg'
>>> ''.join(char for char in x if len(char.encode('utf-8')) < 3)
'abcdefg'


顺便说一句,您可以通过执行以下操作来验证原始字符串没有3字节编码:


BTW, you can verify that your original string does not have 3-byte encodings by doing the following:

>>> for char in 'below ð\x9f~\x83,':
...     print(char, [hex(b) for b in char.encode('utf-8')])
...
b ['0x62']
e ['0x65']
l ['0x6c']
o ['0x6f']
w ['0x77']
  ['0x20']
ð ['0xc3', '0xb0']
  ['0xc2', '0x9f']
~ ['0x7e']
  ['0xc2', '0x83']
, ['0x2c']


一个疯狂的猜测

我认为OP会问一个错误的问题,而问题实际上是该字符是否可打印.我将假定Python显示为 \ x< number> 的所有内容均不可打印,因此此解决方案应该可以工作:

I believe the OP asks the wrong question and the question is in fact whether the character is printable. I'll assume anything Python displays as \x<number> is not printable, so this solution should work:

x = 'below ð\x9f~\x83,'
filtered_x = ''.join(char for char in x if not repr(char).startswith("'\\x"))

结果:

'below ð~,'

这篇关于删除“编码大于3个字节的字符".使用Python 3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆