使用 lxml 解析段落标记的子项时丢失子项 [英] Missing Child While Using lxml to Parse Children of Paragraph Tag
问题描述
我正在使用 Python 库 lxml 对从 这个网址.过去我在使用 lxml 时没有遇到任何问题,但是我可能刚刚遇到了一个错误,即缺少子元素(在 lxml 树中)的形式,该错误明显出现在 HTML 中.
I am using the Python library lxml to perform XML parsing on the HTML retrieved from this url. I have had no trouble using lxml in the past, however I may have just encountered a bug in the form of a missing child element (in the lxml tree) which plainly appears in the HTML.
这是我用来解析 HTML 的 Python 代码:
Here is the Python code I am using to parse the HTML:
from urllib.request import urlopen
from lxml import etree
html_response = urlopen("http://ohhla.com/YFA_natedogg.html")
html_parser = etree.HTMLParser()
tree = etree.parse(html_response, html_parser)
tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0]
来自我正在解析的网站的 HTML 代码的简化版本如下所示:
A simplified version of the HTML code from the website I am parsing looks like this:
<table id='AutoNumber7'>
<tbody>
<tr></tr>
<tr>
<td>
# ... (irrelevant tags) ...
<p>
<a></a>
# The following <table> tag is what I need to target:
<table></table>
</p>
# ... (seven <p> tags identical to the above) ...
</td>
</tr>
</tbody>
当我运行 When I run 这是控制台输出: 我希望看到的是: 任何想法为什么 Any ideas why the 注意:我知道我真的应该看到 Note: I understand that I should really be seeing 对于那些不认为上述代码可重现的人,只需将其复制并粘贴到控制台中即可: For those who don't consider the above code reproducible, literally just copy and paste this into the console: 和之前一样,我尝试解析的 HTML 位于这里. As before the HTML I am trying to parse is located here. 我真的不知道如何比这更简洁.建设性的意见表示赞赏(一如既往). I don't really know how to be more concise than this. Constructive comments are appreciated (as always). 我认为问题在于 lxml 试图按照 HTML 的规则进行操作.根据这些规则, I think the problem is that lxml tries to play by the rules of HTML. According to those rules, 简短演示: 在这段代码的输出中,我们可以看到 lxml 拒绝将 In the output from this code we can see that lxml refuses to interpret http://ohhla.com/YFA_natedogg.html 文档声称是 XHTML,但它有很多错误,无法解析为 XML 文档. The http://ohhla.com/YFA_natedogg.html document claims to be XHTML, but it has many errors and it cannot be parsed as an XML document. 这篇关于使用 lxml 解析段落标记的子项时丢失子项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren()
在控制台中,lxml 只检测初始锚标记 并忽略我需要选择的兄弟
标记(表示为代码中的上述注释).
tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren()
in the console, lxml only detects the initial anchor tag <a>
and ignores the sibling <table>
tag that I need to select (denoted by the above comment in the code). tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren()
Out[22]: [<Element a at 0x2904a2a5808>]
tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren()
Out[22]: [<Element a at 0x2904a2a5808>, <Element table at 0x???????????>]
标签从
<p>
标签的子标签中丢失?如何选择这个 标签?我需要解析 table 标记中的所有内容,但 lxml 似乎无法将其识别为有效的子元素.如果有人可以为所需的
标签提供一个有效的 xpath 选择器,我会非常高兴!
<table>
tag is missing from the <p>
tag's children?
How can I select this <table>
tag? I need to parse all content from the table tag, but lxml seems to not recognize it as a valid child element. If anyone can provide a working xpath selector for the desired <table>
tag I would be very greatful! [
不是 [
但我试图更简洁.[<Element tr at 0x??????????>, <Element tr at 0x???????????>, ...]
not [<Element table at 0x??????????>]
but I was trying to be more concise. from urllib.request import urlopen
from lxml import etree
html_response = urlopen("http://ohhla.com/YFA_natedogg.html")
html_parser = etree.HTMLParser()
tree = etree.parse(html_response, html_parser)
print(tree.xpath("//table[@id='AutoNumber7']/tr[2]/td/p[1]")[0].getchildren())
推荐答案
(块级元素)不能是
的子元素.请参阅 https://www.w3.org/TR/html4/struct/text.html#h-9.3.1.
<table>
(a block level element) cannot be a child of <p>
. See https://www.w3.org/TR/html4/struct/text.html#h-9.3.1.from lxml import html
test = """
<html>
<p>
<table>
<tr>
<td>XXX</td>
</tr>
</table>
</p>
</html>"""
root = html.fromstring(test)
# Just print the string representation of the parsed HTML
print(html.tostring(root).decode("UTF-8"))
解释为
的孩子:
<table>
as a child of <p>
:<html>
<body><p>
</p><table>
<tr>
<td>XXX</td>
</tr>
</table>
</body></html>
是一个内联元素,因此它包含在
getchildren()
的返回值中是有道理的.您必须找到其他方法来识别您感兴趣的 元素.
<a>
is an inline element so it makes sense that it is included in the return value from getchildren()
. You will have to find some other way to identify the <table>
elements that you are interested in.
登录
关闭