BeautifulSoup不会使用utf-8以外的其他编码来解析xml [英] BeautifulSoup does not parse xml with other encoding than utf-8
问题描述
我可以读取以<?xml version="1.0" encoding="utf-8"?>
开头的所有xmls文件,但无法读取以<?xml version="1.0" encoding="ISO-8859-1"?>
开头的文件.
I can read all xmls files that starts with <?xml version="1.0" encoding="utf-8"?>
but I can not read the files starts with <?xml version="1.0" encoding="ISO-8859-1"?>
.
具体来说,我有两个文件:
Specifically, I have two files:
xml_iso.xml :
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
</note>
xml-utf.xml :
<?xml version="1.0" encoding="utf-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
</note>
使用以下代码,我可以使用utf-8
找到文件的note
,但是在使用其他编码的文件中找不到它.我该怎么解决?
With the following code I can find the note
for the file with utf-8
but I can not find it in the file with the other encoding. How can I solve that?
示例代码:
import unittest
from bs4 import BeautifulSoup as Soup
class TestEncoding(unittest.TestCase):
def test_iso(self):
with open('tests/xml-iso.xml', 'r') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
print('xml-iso:\n{}'.format(xml_soup))
note = xml_soup.find('note')
self.assertIsNotNone(note)
def test_utf8(self):
with open('tests/xml-utf.xml', 'r') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
print('xml-utf8:\n{}'.format(xml_soup))
note = xml_soup.find('note')
self.assertIsNotNone(note)
if __name__ == '__main__':
unittest.main()
版本:
-
Python 3.5.2
Python 3.5.2
beautifulsoup4 == 4.6.0
beautifulsoup4==4.6.0
推荐答案
巧合的是,我偶然发现了另一个解决方法.以二进制模式('rb'
)读取文件:
Coincidentally I stumbled upon another workaround. Read the file in binary mode ('rb'
):
with open('tests/xml-iso.xml', 'rb') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
这篇关于BeautifulSoup不会使用utf-8以外的其他编码来解析xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!