使用(X)HTML实体解析XML [英] Parse XML with (X)HTML entities

查看:58
本文介绍了使用(X)HTML实体解析XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用ElementTree解析XML,其中包含未定义的实体(即& nbsp; ),将引发:

Trying to parse XML, with ElementTree, that contains undefined entity (i.e.  ) raises:

ParseError:未定义实体& nbsp;

在Python 2.x中,XML实体字典可以通过以下方式更新创建解析器(文档):

In Python 2.x XML entity dict can be updated by creating parser (documentation):

parser = ET.XMLParser()
parser.entity["nbsp"] = unichr(160)

但是如何使用Python 3.x做到这一点?

but how to do the same with Python 3.x?

更新:从我这边有一个误解,因为我忽略了在尝试更新XML实体dict之前调用 parser.parser.UseForeignDTD(1)的方法,导致解析器出错。幸运的是,@ m.brindley很有耐心,并指出XML实体dict在Python 3.x中仍然存在,并且可以像在Python 2.x中一样进行更新。

Update: There was misunderstanding from my side, because I overlooked that I was calling parser.parser.UseForeignDTD(1) before trying to update XML entity dict, which was causing error with the parser. Luckily, @m.brindley was patient and pointed that XML entity dict still exists in Python 3.x and can be updated the same way as in Python 2.x

推荐答案

这里的问题是XML中唯一有效的助记符实体是 quot amp apos lt gt 。这意味着几乎所有(X)HTML命名实体都必须使用 XML 1.1规范中定义的>实体声明标记。如果文档是独立的,则应使用内联DTD来完成,例如:

The issue here is that the only valid mnemonic entities in XML are quot, amp, apos, lt and gt. This means that almost all (X)HTML named entities must be defined in the DTD using the entity declaration markup defined in the XML 1.1 spec. If the document is to be standalone, this should be done with an inline DTD like so:

<?xml version="1.1" ?>
<!DOCTYPE naughtyxml [
    <!ENTITY nbsp "&#0160;">
    <!ENTITY copy "&#0169;">
]>
<data>
    <country name="Liechtenstein">
        <rank>1&nbsp;&gt;</rank>
        <year>2008&copy;</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
</data>

XMLParser > xml.etree.ElementTree 使用 xml.parsers.expat 进行实际的解析。在 XMLParser 的init参数中,'预定义的HTML实体,但该参数尚未实现。在init方法中创建了一个名为 entity 的空字典,这是用来查找未定义实体的方法。

The XMLParser in xml.etree.ElementTree uses an xml.parsers.expat to do the actual parsing. In the init arguments for XMLParser, there is a space for 'predefined HTML entities' but that argument is not implemented yet. An empty dict named entity is created in the init method and this is what is used to look up undefined entities.

我不认为expat(扩展名为ET XMLParser)能够处理将名称空间切换到XHMTL之类的方法来解决此问题。可能是因为它不会获取外部名称空间定义(我尝试将 xmlns = http://www.w3.org/1999/xhtml 设置为数据元素的默认名称空间但效果不佳),但我无法确认。默认情况下,expat会针对非XML实体引发错误,但是您可以通过定义外部DOCTYPE来解决此问题-这会导致expat解析器将未定义的实体条目传递回 ET.XMLParser _default()方法。

I don't think expat (by extension, the ET XMLParser) is able to handle switching namespaces to something like XHMTL to get around this. Possibly because it will not fetch external namespace definitions (I tried making xmlns="http://www.w3.org/1999/xhtml" the default namespace for the data element but it did not play nicely) but I can't confirm that. By default, expat will raise an error against non XML entities but you can get around that by defining an external DOCTYPE - this causes the expat parser to pass undefined entity entries back to the ET.XMLParser's _default() method.

_default()方法在 XMLParser 实例中查找实体字典,如果找到匹配项键,它将用关联的值替换实体。这保持了问题中提到的Python-2.x语法。

The _default() method does a look up of the entity dict in the XMLParser instance and if it finds a matching key, it will replace the entity with the associated value. This maintains the Python-2.x syntax mentioned in the question.

解决方案:


  • 如果数据没有外部DOCTYPE且具有(X)HTML助记符实体,则您不走运。这是无效的XML,expat可以抛出错误。您应该添加一个外部DOCTYPE。

  • 如果数据具有外部DOCTYPE,则可以使用旧语法将助记符名称映射到字符。 注意:您应该在py3k中使用 chr()- unichr()不再是有效名称

    • 或者,您可以使用 html.entities.html5更新 XMLParser.entity 以将所有有效的HTML5助记符实体映射到其字符。

    • If the data does not have an external DOCTYPE and has (X)HTML mnemonic entities, you are out of luck. It is not valid XML and expat is right to throw an error. You should add an external DOCTYPE.
    • If the data has an external DOCTYPE, you can just use your old syntax to map mnemonic names to characters. Note: you should use chr() in py3k - unichr() is not a valid name anymore
      • Alternatively, you could update XMLParser.entity with html.entities.html5 to map all valid HTML5 mnemonic entities to their characters.

      这是我使用的代码段-它通过 HTMLParser 用外部DOCTYPE解析XML(以演示如何添加通过子类处理实体),具有实体映射的 ET.XMLParser expat (由于以下原因,它们将默默地忽略未定义的实体):外部DOCTYPE)。我有一个有效的XML实体(& gt; )和一个未定义的实体(& copy; )使用 ET.XMLParser 映射到 chr(0x24B4)

      Here is the snippet I used - it parses XML with an external DOCTYPE through HTMLParser (to demonstrate how to add entity handling by subclassing), ET.XMLParser with entity mappings and expat (which will just silently ignore undefined entities due to the external DOCTYPE). There is a valid XML entity (&gt;) and an undefined entity (&copy;) which I map to chr(0x24B4) with the ET.XMLParser.

      from html.parser import HTMLParser
      from html.entities import name2codepoint
      import xml.etree.ElementTree as ET
      import xml.parsers.expat as expat
      
      xml = '''<?xml version="1.0"?>
      <!DOCTYPE data PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
      <data>
          <country name="Liechtenstein">
              <rank>1&gt;</rank>
              <year>2008&copy;</year>
              <gdppc>141100</gdppc>
              <neighbor name="Austria" direction="E"/>
              <neighbor name="Switzerland" direction="W"/>
          </country>
      </data>'''
      
      # HTMLParser subclass which handles entities
      print('=== HTMLParser')
      class MyHTMLParser(HTMLParser):
          def handle_starttag(self, name, attrs):
              print('Start element:', name, attrs)
          def handle_endtag(self, name):
              print('End element:', name)
          def handle_data(self, data):
              print('Character data:', repr(data))
          def handle_entityref(self, name):
              self.handle_data(chr(name2codepoint[name]))
      
      htmlparser = MyHTMLParser()
      htmlparser.feed(xml)
      
      
      # ET.XMLParser parse
      print('=== XMLParser')
      parser = ET.XMLParser()
      parser.entity['copy'] = chr(0x24B8)
      root = ET.fromstring(xml, parser)
      print(ET.tostring(root))
      for elem in root:
          print(elem.tag, ' - ', elem.attrib)
          for subelem in elem:
              print(subelem.tag, ' - ', subelem.attrib, ' - ', subelem.text)
      
      # Expat parse
      def start_element(name, attrs):
          print('Start element:', name, attrs)
      def end_element(name):
          print('End element:', name)
      def char_data(data):
          print('Character data:', repr(data))
      print('=== Expat')
      expatparser = expat.ParserCreate()
      expatparser.StartElementHandler = start_element
      expatparser.EndElementHandler = end_element
      expatparser.CharacterDataHandler = char_data
      expatparser.Parse(xml)
      

      这篇关于使用(X)HTML实体解析XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆