如何在lxml中查找element的直接子级 [英] How to find direct children of element in lxml

查看:380
本文介绍了如何在lxml中查找element的直接子级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到了具有特定类的对象:

I found an object with specific class:

THREAD = TREE.find_class('thread')[0]

现在,我想获取所有直接作为其子元素的<p>元素.

Now I want to get all <p> elements that are its direct children.

我很累:

THREAD.findall("p")

THREAD.xpath("//div[@class='thread']/p")

但是所有这些都返回此<div>中的所有<p>元素,无论<div>是否是其最接近的父级.

But all of those returns all <p> elements inside this <div>, no matter if that <div> is their closest parent or not.

我如何使它工作?

示例html:

<div class='thread'>
   <p> <!-- 1 -->
      <!-- Can be some others <p> objects inside, which should not be counted -->
   </p> 
   <p><!-- 2 --></p>
</div>
<div class='thread'>
   <p>[...]</p>
   <p>[...]</p>
</div>

脚本应找到两个对象<p>,它们是THREAD的子级.我应该收到两个对象的列表,在示例HTML的注释中分别标记为"1"和"2".

script should find two objects <p>, which are children of THREAD. I should receive list of two objects, marked as "1" and "2" in comments in sample HTML.

又一次澄清,因为人们感到困惑:

Yet another clarification, since people get confused:

THREAD是一些存储在变量中的对象,可以是任何html元素.我想找到作为THREAD的直接子代的<p>对象.这些<p>不能在THREAD之外,也不能在THREAD内的任何元素之内.

THREAD is some object stored in variable, can be any html element. I want to find <p> objects that are direct children of THREAD. Those <p>'s can not be outside THREAD or inside any element that's also inside THREAD.

推荐答案

我不确定,但看来您的问题出在HTML本身:请注意

I'm not sure, but it seem that your problem is in HTML itself: note that there are couple Tag omission cases applicable for p nodes, so closing tags of paragraphs

<div class='thread'>
    <p>first
        <p>second</p>
    </p>
</div>

解析器会简单地将其忽略,并且两个节点都被标识为同级,而不是父级和子级,例如

simply ignored by parser and both nodes identified as siblings, but not parent and child, e.g.

<div class='thread'>
    <p>first
    <p>second
</div>

因此XPath //div[@class="thread"]/p会同时向您返回这两个段落

So XPath //div[@class="thread"]/p will return you both paragraphs

您只需将p标签替换为div标签,您就会看到不同的行为:

You can simply replace p tags with div tags and you'll see different behaviour:

<div class='thread'>
    <div>first
        <div>second</div>
    </div>
</div>

此处//div[@class="thread"]/div仅返回第一个节点

如果我的假设不正确,请纠正我

Please correct me if my assumption is incorrect

这篇关于如何在lxml中查找element的直接子级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆