如何使Python bs4在XML上正常工作? [英] How to get Python bs4 to work properly on XML?

查看:133
本文介绍了如何使Python bs4在XML上正常工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python和BeautifulSoup 4(bs4)将Inkscape SVG转换为某些专有软件的XML格式.我似乎无法让bs4正确解析一个最小的示例.我需要解析器尊重自动关闭标签,处理unicode而不添加html内容.我认为用selfClosingTags指定'lxml'解析器应该可以,但是不能!看看.

I'm trying to use Python and BeautifulSoup 4 (bs4) to convert Inkscape SVGs into an XML-like format for some proprietary software. I can't seem to get bs4 to correctly parse a minimal example. I need the parser to respect self-closing tags, handle unicode, and not add html stuff. I thought specifying the 'lxml' parser with selfClosingTags should do it, but nope! check it out.

#!/usr/bin/python
from __future__ import print_function
from bs4 import BeautifulSoup

print('\nbs4 mangled XML:')
print(BeautifulSoup('<x><c name="b1"><d value="a"/></c></x>',
    features = "lxml", 
    selfClosingTags = ('d')).prettify())

print('''\nExpected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>''')

此打印

bs4 mangled XML:
/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.4.1-py2.7.egg/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
<html>
 <body>
  <x>
   <c name="b1">
    <d value="a">
    </d>
   </c>
  </x>
 </body>
</html>

Expected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>

我已经查看了有关StackOverflow的问题,但没有找到解决方案.

I've reviewed related StackOverflow questions, and I am not finding the solution.

此问题解决html样板文件,但仅用于解析html的子节,而不用于解析xml.

This question addresses the html boilerplate, but only for parsing subsections of html, not for parsing xml.

这个问题与使beautifulsoup 4尊重自动闭合标签有关,目前尚无公认的答案.

This question pertains to getting beautifulsoup 4 to respect self-closing tags, and has no accepted answers.

此问题似乎表明传递selfClosingTags参数应该会有所帮助,但是如您所见现在会生成警告BS4 does not respect the selfClosingTags argument,并且自动关闭的标签已损坏.

This question seems to indicate that passing the selfClosingTags argument should help, but as you can see this now generates a warning BS4 does not respect the selfClosingTags argument, and self-closing tags are mangled.

这个问题提示使用"xml"(而不是"lxml")将导致空标签自动自动关闭.此可能可以满足我的需求,但是将"xml"解析器应用于我的实际数据失败,因为文件包含unicode,而"xml"解析器不支持.

This question suggests that using "xml" (not "lxml") will cause empty tags to be automatically self-closing. This might work for my purposes, but applying the "xml" parser to my actual data fails because the files contain unicode, which the "xml" parser does not support.

"xml"与"lxml"不同,并且在标准上"xml" 不能支持Unicode,而"lxml" 不能包含自动关闭标签?也许我只是在尝试做被禁止的事情?

Is "xml" different from "lxml", and is it in the standard that "xml" cannot support unicode, and "lxml" cannot contain self-closing tags? Perhaps I'm simply trying to do something that is forbidden?

推荐答案

如果希望将结果输出为xml,则将其解析为.您的xml数据可以包含unicode,但是,您需要声明编码:

If you want the result to be output as xml then parse it as that. Your xml data can contain unicode, however, you need to declare the encoding:

#!/usr/bin/env python
# -*- encoding: utf8 -*-

SelfClosingTags不再被识别.相反,美丽 汤认为任何空标签都是空元素标签.如果您添加一个 子元素为空元素标签,它将不再是空元素标签.

The SelfClosingTags is no longer recognized. Instead, Beautiful Soup considers any empty tag to be an empty-element tag. If you add a child to an empty-element tag, it stops being an empty-element tag.

更改您的函数以使其看起来应该工作(除了编码):

Changing your function to look like this should work (In addition to the encoding):

print(BeautifulSoup('<x><c name="b1"><d value="a®"/></c></x>',
    features = "xml").prettify())

结果:

<?xml version="1.0" encoding="utf-8"?>
<x>
 <c name="b1">
  <d value="aÂŽ"/>
 </c>
</x>

这篇关于如何使Python bs4在XML上正常工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆