在python中过滤掉某些字节 [英] Filtering out certain bytes in python
问题描述
我在python程序中遇到此错误:ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I'm getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes, explains the issue.
解决方案是过滤掉某些字节,但是我对如何执行此操作感到困惑.
The solution was to filter out certain bytes, but I'm confused about how to go about doing this.
有帮助吗?
对不起,如果我没有提供有关该问题的足够信息.字符串数据来自外部api查询,我无法控制数据的格式.
sorry if i didn't give enough info about the problem. the string data comes from an external api query of which i have no control over the how the data is formatted.
推荐答案
正如链接问题的答案所述,XML标准将有效字符定义为:
As the answer to the linked question said, the XML standard defines a valid character as:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
将其翻译成Python:
Translating that into Python:
def valid_xml_char_ordinal(c):
codepoint = ord(c)
# conditions ordered by presumed frequency
return (
0x20 <= codepoint <= 0xD7FF or
codepoint in (0x9, 0xA, 0xD) or
0xE000 <= codepoint <= 0xFFFD or
0x10000 <= codepoint <= 0x10FFFF
)
然后您就可以使用该功能,例如
You can then use that function however you need to, e.g.
cleaned_string = ''.join(c for c in input_string if valid_xml_char_ordinal(c))
这篇关于在python中过滤掉某些字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!