如何使用Scapy的XPath选择所有子文本,但不包括标签? [英] How to select all children text but excluding a tag with Scapy's XPath?
问题描述
我有这个html:
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
<div class="infobox">
<font style="color:#000000"><b>Information Title</b></font>
<br><br>Long Information Text
</div>
</div>
我想在Scrapy中使用XPath获取< div id ="content">
中的所有文本,但不包括< div class ="infobox">
的内容,因此预期的结果是这样的:
I want to get all text in <div id="content">
with XPath in Scrapy but excluding <div class="infobox">
's content, so the expected result is like this:
Title 1
Sub-Title 1
Descripton 1.
Descripton 2.
Sub-Title 2
Descripton 1.
Descripton 2.
但是我还没有达到排除部分,我仍在努力从< div id ="content">
中获取文本.
But I haven't reached the excluding part yet, I'm still struggling to grab the text from the <div id="content">
.
我已经尝试过了:
response.xpath('//*[@id="content"]/text()').extract()
但是它仅从两个子标题中返回描述1
和描述2
.
But it only returns Description 1.
and Description 2.
from both Sub-Title.
然后我尝试:
response.xpath('//*[@id="content"]//*/text()').extract()
它仅返回 Title 1
, Sub-Title 1
, Sub-Title 2
, Information Title
,和长信息文本
.
所以这里有两个问题:
So there are two questions here:
- 如何从
content
div中获取所有子级文本? - 如何从选择中排除
infobox
div?
- How could I get all of children text from
content
div? - How to exclude the
infobox
div from the selection?
推荐答案
使用 descendant ::
轴查找后代文本节点,并明确声明这些文本节点的父级不能为 div [@ class ='infobox']
元素.
Use the descendant::
axis to find descendant text nodes, and state explicitly that the parent of those text nodes must not be a div[@class='infobox']
element.
将以上内容转换为XPath表达式:
Turning the above into an XPath expression:
//div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')]
然后,结果类似于以下内容(我使用在线XPath工具进行了测试).如您所见, div [@ class ='infobox']
的文本内容不再出现在结果中.
Then, the result is similar to (I tested with an online XPath tool) the following. As you can see, the text content of div[@class='infobox']
does no longer appear in the result.
-----------------------
Title 1
-----------------------
-----------------------
Sub-Title 1
-----------------------
-----------------------
Description 1.
-----------------------
Description 2.
-----------------------
-----------------------
Sub-Title 2
-----------------------
-----------------------
Description 1
-----------------------
Description 2
-----------------------
-----------------------
-----------------------
您的方法有什么问题?
您的首次尝试:
//*[@id="content"]/text()
用简单的英语表示:
在文档中的任何位置查找具有属性
@id
,其值为"content"的任何元素(不一定是div
).对于此元素,返回其所有立即子文本节点.
Look for any element (not necessarily a
div
) anywhere in the document, that has an attribute@id
, its value being "content". For this element, return all its immediate child text nodes.
问题:您丢失了不是外部 div
的直接子代的文本节点,因为它们位于该 div
的子元素内.
Problem: You are losing the text nodes that are not an immediate child of the outer div
, since they are inside a child element of that div
.
您的第二次尝试:
//*[@id="content"]//*/text()
翻译为:
在文档中的任何位置查找具有属性
@id
,其值为"content"的任何元素(不一定是div
).对于该元素,找到任何后代元素节点,然后返回该后代元素的所有文本节点.
Look for any element (not necessarily a
div
) anywhere in the document, that has an attribute@id
, its value being "content". For this element, find any descendant element node and return all text nodes of that descendant element.
问题:您丢失了 div
的直接子文本节点,因为您只查看了属于 div
的子元素的文本节点.
Problem: You are losing the immediate child text nodes of the div
, since you are only looking at text nodes that are children of elements that are descendants of the div
.
编辑:
回复您的评论
//div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')]
对于以后的问题,请确保显示的HTML是您的实际问题的代表.
For your future questions, please make sure the HTML you show is representative of your actual problems.
这篇关于如何使用Scapy的XPath选择所有子文本,但不包括标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!