如何使用Scapy的XPath选择所有子文本,但不包括标签? [英] How to select all children text but excluding a tag with Scapy's XPath?

查看:57
本文介绍了如何使用Scapy的XPath选择所有子文本,但不包括标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个html:

<div id="content">
    <h1>Title 1</h1><br><br>

    <h2>Sub-Title 1</h2>
    <br><br>
    Description 1.<br><br>Description 2.
    <br><br>

    <h2>Sub-Title 2</h2>
    <br><br>
    Description 1<br>Description 2<br>
    <br><br>

    <div class="infobox">
        <font style="color:#000000"><b>Information Title</b></font>
        <br><br>Long Information Text
    </div>
</div>

我想在Scrapy中使用XPath获取< div id ="content"> 中的所有文本,但不包括< div class ="infobox"> 的内容,因此预期的结果是这样的:

I want to get all text in <div id="content"> with XPath in Scrapy but excluding <div class="infobox">'s content, so the expected result is like this:

Title 1


Sub-Title 1


Descripton 1.

Descripton 2.


Sub-Title 2


Descripton 1.
Descripton 2.

但是我还没有达到排除部分,我仍在努力从< div id ="content"> 中获取文本.

But I haven't reached the excluding part yet, I'm still struggling to grab the text from the <div id="content">.

我已经尝试过了:

response.xpath('//*[@id="content"]/text()').extract()

但是它仅从两个子标题中返回描述1 描述2 .

But it only returns Description 1. and Description 2. from both Sub-Title.

然后我尝试:

response.xpath('//*[@id="content"]//*/text()').extract()

它仅返回 Title 1 Sub-Title 1 Sub-Title 2 Information Title ,和长信息文本.


所以这里有两个问题:


So there are two questions here:

  1. 如何从 content div中获取所有子级文本?
  2. 如何从选择中排除 infobox div?
  1. How could I get all of children text from content div?
  2. How to exclude the infobox div from the selection?

推荐答案

使用 descendant :: 轴查找后代文本节点,并明确声明这些文本节点的父级不能为 div [@ class ='infobox'] 元素.

Use the descendant:: axis to find descendant text nodes, and state explicitly that the parent of those text nodes must not be a div[@class='infobox'] element.

将以上内容转换为XPath表达式:

Turning the above into an XPath expression:

//div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')]

然后,结果类似于以下内容(我使用在线XPath工具进行了测试).如您所见, div [@ class ='infobox'] 的文本内容不再出现在结果中.

Then, the result is similar to (I tested with an online XPath tool) the following. As you can see, the text content of div[@class='infobox'] does no longer appear in the result.

-----------------------
Title 1
-----------------------
-----------------------
Sub-Title 1
-----------------------
-----------------------
Description 1.
-----------------------
Description 2.
-----------------------
-----------------------
Sub-Title 2
-----------------------
-----------------------
Description 1
-----------------------
Description 2
-----------------------
-----------------------
-----------------------

您的方法有什么问题?

您的首次尝试:

//*[@id="content"]/text()

用简单的英语表示:

在文档中的任何位置查找具有属性 @id ,其值为"content"的任何元素(不一定是 div ).对于此元素,返回其所有立即子文本节点.

Look for any element (not necessarily a div) anywhere in the document, that has an attribute @id, its value being "content". For this element, return all its immediate child text nodes.

问题:您丢失了不是外部 div 的直接子代的文本节点,因为它们位于该 div 的子元素内.

Problem: You are losing the text nodes that are not an immediate child of the outer div, since they are inside a child element of that div.

您的第二次尝试:

//*[@id="content"]//*/text()

翻译为:

在文档中的任何位置查找具有属性 @id ,其值为"content"的任何元素(不一定是 div ).对于该元素,找到任何后代元素节点,然后返回该后代元素的所有文本节点.

Look for any element (not necessarily a div) anywhere in the document, that has an attribute @id, its value being "content". For this element, find any descendant element node and return all text nodes of that descendant element.

问题:您丢失了 div 的直接子文本节点,因为您只查看了属于 div 的子元素的文本节点.

Problem: You are losing the immediate child text nodes of the div, since you are only looking at text nodes that are children of elements that are descendants of the div.

编辑:

回复您的评论

//div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')]

对于以后的问题,请确保显示的HTML是您的实际问题的代表.

For your future questions, please make sure the HTML you show is representative of your actual problems.

这篇关于如何使用Scapy的XPath选择所有子文本,但不包括标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆