BeautifulSoup不会使用utf-8以外的其他编码来解析xml [英] BeautifulSoup does not parse xml with other encoding than utf-8

查看:72
本文介绍了BeautifulSoup不会使用utf-8以外的其他编码来解析xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以读取以<?xml version="1.0" encoding="utf-8"?>开头的所有xmls文件,但无法读取以<?xml version="1.0" encoding="ISO-8859-1"?>开头的文件.

I can read all xmls files that starts with <?xml version="1.0" encoding="utf-8"?> but I can not read the files starts with <?xml version="1.0" encoding="ISO-8859-1"?>.

具体来说,我有两个文件:

Specifically, I have two files:

xml_iso.xml :

<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
</note>

xml-utf.xml :

<?xml version="1.0" encoding="utf-8"?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
</note>

使用以下代码,我可以使用utf-8找到文件的note,但是在使用其他编码的文件中找不到它.我该怎么解决?

With the following code I can find the note for the file with utf-8 but I can not find it in the file with the other encoding. How can I solve that?

示例代码:

import unittest

from bs4 import BeautifulSoup as Soup

class TestEncoding(unittest.TestCase):
    def test_iso(self):
        with open('tests/xml-iso.xml', 'r') as f_in:
            xml_soup = Soup(f_in.read(), 'xml')
        print('xml-iso:\n{}'.format(xml_soup))
        note = xml_soup.find('note')
        self.assertIsNotNone(note)

    def test_utf8(self):
        with open('tests/xml-utf.xml', 'r') as f_in:
            xml_soup = Soup(f_in.read(), 'xml')
        print('xml-utf8:\n{}'.format(xml_soup))
        note = xml_soup.find('note')
        self.assertIsNotNone(note)

if __name__ == '__main__':
    unittest.main()

版本:

  • Python 3.5.2

  • Python 3.5.2

beautifulsoup4 == 4.6.0

beautifulsoup4==4.6.0

推荐答案

巧合的是,我偶然发现了另一个解决方法.以二进制模式('rb')读取文件:

Coincidentally I stumbled upon another workaround. Read the file in binary mode ('rb'):

with open('tests/xml-iso.xml', 'rb') as f_in:
    xml_soup = Soup(f_in.read(), 'xml')

这篇关于BeautifulSoup不会使用utf-8以外的其他编码来解析xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆