使用 lxml 解析段落标记的子项时丢失子项 [英] Missing Child While Using lxml to Parse Children of Paragraph Tag

查看:32
本文介绍了使用 lxml 解析段落标记的子项时丢失子项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 库 lxml 对从 这个网址.过去我在使用 lxml 时没有遇到任何问题,但是我可能刚刚遇到了一个错误,即缺少子元素(在 lxml 树中)的形式,该错误明显出现在 HTML 中.

I am using the Python library lxml to perform XML parsing on the HTML retrieved from this url. I have had no trouble using lxml in the past, however I may have just encountered a bug in the form of a missing child element (in the lxml tree) which plainly appears in the HTML.

这是我用来解析 HTML 的 Python 代码:

Here is the Python code I am using to parse the HTML:

from urllib.request import urlopen
from lxml import etree

html_response = urlopen("http://ohhla.com/YFA_natedogg.html")
html_parser = etree.HTMLParser()
tree = etree.parse(html_response, html_parser)
tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0]

来自我正在解析的网站的 HTML 代码的简化版本如下所示:

A simplified version of the HTML code from the website I am parsing looks like this:

<table id='AutoNumber7'>
    <tbody>
        <tr></tr>
        <tr>
            <td>
                # ... (irrelevant tags) ... 
                <p>
                    <a></a>
                    # The following <table> tag is what I need to target:
                    <table></table>
                </p>
                # ... (seven <p> tags identical to the above) ...
            </td>
        </tr>
    </tbody>

当我运行 tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren() 在控制台中,lxml 只检测初始锚标记 并忽略我需要选择的兄弟

标记(表示为代码中的上述注释).

When I run tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren() in the console, lxml only detects the initial anchor tag <a> and ignores the sibling <table> tag that I need to select (denoted by the above comment in the code).

这是控制台输出:

tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren()
Out[22]: [<Element a at 0x2904a2a5808>]

我希望看到的是:

tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren()
Out[22]: [<Element a at 0x2904a2a5808>, <Element table at 0x???????????>]

任何想法为什么

标签从 <p> 标签的子标签中丢失?如何选择这个
标签?我需要解析 table 标记中的所有内容,但 lxml 似乎无法将其识别为有效的子元素.如果有人可以为所需的
标签提供一个有效的 xpath 选择器,我会非常高兴!

Any ideas why the <table> tag is missing from the <p> tag's children? How can I select this <table> tag? I need to parse all content from the table tag, but lxml seems to not recognize it as a valid child element. If anyone can provide a working xpath selector for the desired <table> tag I would be very greatful!

注意:我知道我真的应该看到 [, <Element tr at 0x????????????>, ...] 不是 [] 但我试图更简洁.

Note: I understand that I should really be seeing [<Element tr at 0x??????????>, <Element tr at 0x???????????>, ...] not [<Element table at 0x??????????>] but I was trying to be more concise.

对于那些不认为上述代码可重现的人,只需将其复制并粘贴到控制台中即可:

For those who don't consider the above code reproducible, literally just copy and paste this into the console:

from urllib.request import urlopen
from lxml import etree

html_response = urlopen("http://ohhla.com/YFA_natedogg.html")
html_parser = etree.HTMLParser()
tree = etree.parse(html_response, html_parser)
print(tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren())

和之前一样,我尝试解析的 HTML 位于这里.

As before the HTML I am trying to parse is located here.

我真的不知道如何比这更简洁.建设性的意见表示赞赏(一如既往).

I don't really know how to be more concise than this. Constructive comments are appreciated (as always).

  • 链接到我已经阅读过的页面(例如如何创建一个最小的,完整且可验证示例) 无评论不是建设性的批评.
  • 指出我可能遗漏了哪些步骤,或者将来需要改进的地方(从特定资源)是建设性的批评,对我自己和整个社区都有好处.
  • 我很乐意接受有关如何改进我的帖子的建议,但请提供实际建议.请记住,多个人可能会阅读相同的资源并得出不同的结论.
  • Linking to pages I have already read (e.g. How to create a Minimal, Complete, and Verifiable example) without commentary is not constructive criticism.
  • Pointing out what steps I may have missed, or what to improve upon in the future (from a particular resource) is constructive criticism that is beneficial to both myself and the community as a whole.
  • I gladly accept advice on how to improve my posts, but please provide actual recommendations. Remember that several people may read the same resource and come to separate conclusions.

推荐答案

我认为问题在于 lxml 试图按照 HTML 的规则进行操作.根据这些规则,

(块级元素)不能是

的子元素.请参阅 https://www.w3.org/TR/html4/struct/text.html#h-9.3.1.

I think the problem is that lxml tries to play by the rules of HTML. According to those rules, <table> (a block level element) cannot be a child of <p>. See https://www.w3.org/TR/html4/struct/text.html#h-9.3.1.

简短演示:

from lxml import html

test = """
<html>
  <p>
    <table>
      <tr>
        <td>XXX</td>
      </tr>
    </table>
  </p>
</html>"""

root = html.fromstring(test)

# Just print the string representation of the parsed HTML
print(html.tostring(root).decode("UTF-8"))

在这段代码的输出中,我们可以看到 lxml 拒绝将

解释为

的孩子:

In the output from this code we can see that lxml refuses to interpret <table> as a child of <p>:

<html>
  <body><p>
    </p><table>
      <tr>
        <td>XXX</td>
      </tr>
    </table>

</body></html>

是一个内联元素,因此它包含在 getchildren() 的返回值中是有道理的.您必须找到其他方法来识别您感兴趣的

元素.

<a> is an inline element so it makes sense that it is included in the return value from getchildren(). You will have to find some other way to identify the <table> elements that you are interested in.

http://ohhla.com/YFA_natedogg.html 文档声称是 XHTML,但它有很多错误,无法解析为 XML 文档.

The http://ohhla.com/YFA_natedogg.html document claims to be XHTML, but it has many errors and it cannot be parsed as an XML document.

这篇关于使用 lxml 解析段落标记的子项时丢失子项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆