在python中过滤掉某些字节 [英] Filtering out certain bytes in python

查看:224
本文介绍了在python中过滤掉某些字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python程序中遇到此错误:ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

I'm getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

这个问题,

This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes, explains the issue.

解决方案是过滤掉某些字节,但是我对如何执行此操作感到困惑.

The solution was to filter out certain bytes, but I'm confused about how to go about doing this.

有帮助吗?

对不起,如果我没有提供有关该问题的足够信息.字符串数据来自外部api查询,我无法控制数据的格式.

sorry if i didn't give enough info about the problem. the string data comes from an external api query of which i have no control over the how the data is formatted.

推荐答案

正如链接问题的答案所述,XML标准将有效字符定义为:

As the answer to the linked question said, the XML standard defines a valid character as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

将其翻译成Python:

Translating that into Python:

def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    # conditions ordered by presumed frequency
    return (
        0x20 <= codepoint <= 0xD7FF or
        codepoint in (0x9, 0xA, 0xD) or
        0xE000 <= codepoint <= 0xFFFD or
        0x10000 <= codepoint <= 0x10FFFF
        )

然后您就可以使用该功能,例如

You can then use that function however you need to, e.g.

cleaned_string = ''.join(c for c in input_string if valid_xml_char_ordinal(c))

这篇关于在python中过滤掉某些字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆