XPath:通过当前节点属性选择当前和下一个节点的文本 [英] XPath: Select Current and Next Node's text by Current Node Attributes
问题描述
如果这是一个重复的问题,我深表歉意,但我无法在 SO 或其他地方找到另一个似乎可以解决我需要的问题.这是我的问题:
If this is a repeat question, I apologize, but I can't find another question either on SO or elsewhere that seems to handle what I need. Here is my question:
我正在使用 scrapy
从 中获取一些信息这个网页.为清楚起见,以下是我感兴趣的该网页的源代码块:
I'm using scrapy
to get some information out of this webpage. For clarity, following is a block of the source code from that webpage, which is of interest to me:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
</span><br/><br/<br/>
该页面上的几乎所有代码都类似于上面的代码块.
Almost all of the code on that page looks like the above block.
从这一切中,我需要抓住:
From all of this, I need to grab:
- ANT101H5 生物人类学和考古学导论
- 排除:ANT100Y5
- 先决条件:ANT102H5
问题在于 Exclusion:
位于 内,而
ANT100Y5
位于以下 <代码>.
The problem is that Exclusion:
is inside a <span class="title2">
and ANT100Y5
is inside the following <a>
.
我似乎无法从这个源代码中获取它们.目前,我有尝试(但失败)抓取 ANT100Y5
的代码,如下所示:
I don't seem to be able to grab both of them out of this source code. Currently, I have code that attempts (and fails) to grab ANT100Y5
which looks like:
hxs = HtmlXPathSelector(response)
sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")
我很感激这方面的任何帮助,即使这是一个你因为没有看到另一个完美回答这个问题的SO问题而失明"(在这种情况下,我会投票结束这个).我实在是太不知所措了.
I'd appreciate any help with this, even if it's a "you're blind for not seeing this other SO question which answers this perfectly" (in which case, myself will vote to close this). I really am that much at my wits end.
提前致谢
在@Dimitre 建议的更改后完成原始代码
我正在使用以下代码:
class regcalSpider(BaseSpider):
name = "disc"
allowed_domains = ['www.utm.utoronto.ca']
start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select("/*/p/text()[1] | \
(//span[@class='title2'])[1]/text() | \
(//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
(//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text()")
for site in sites:
item = RegcalItem()
item['title'] = site.select("a/text()").extract()
item['link'] = site.select("a/@href").extract()
item['desc'] = site.select("text()").extract()
items.append(item)
return items
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
这给了我这个结果:
[{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []}]
这不是我需要的输出.我究竟做错了什么?请记住,如上所述,我正在 this 上运行此脚本.
This is not the output that I need. What am I doing wrong? Keep in mind that I'm running this script on this, as mentioned.
推荐答案
我的回答和@Flack 的很像:
拥有此 XML 文档(更正了提供的文档以关闭许多未封闭的 <br>
并将所有内容包装在单个顶部元素中):
Having this XML document (corrected the provided one in closing numerous unclosed <br>
s and in wrapping everything in a single top element):
<body>
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span>
</p>
<span class='normaltext'> Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [
<span class='Helpcourse' onMouseover="showtip(this,event,'24 Lectures')" onMouseout="hidetip()">24L</span>,
<span class='Helpcourse' onMouseover="showtip(this,event,'12 Tutorials')" onMouseout="hidetip()">12T</span>]
<br/>
<span class='title2'>Exclusion: </span>
<a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a>
<br/>
<span class='title2'>Prerequisite: </span>
<a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a>
<br/>
</span>
<br/>
<br/>
<br/>
</body>
这个 XPath 表达式:
normalize-space(/*/p/text()[1])
当评估产生想要的字符串(周围的引号不在结果中.我添加它们以显示产生的确切字符串):
when evaluated produces the wanted string (the surrounding quotes are not in the result. I added them to show the exact string produced):
"ANT101H5 Introduction to Biological Anthropology and Archaeology"
这个 XPath 表达式:
concat((//span[@class='title2'])[1],
(//span[@class='title2'])[1]
/following-sibling::a[1]
)
当评估产生以下想要的结果时:
when evaluated produces the following wanted result:
"Exclusion: ANT100Y5"
这个 XPath 表达式:
concat((//span[@class='title2'])[2],
(//span[@class='title2'])[2]
/following-sibling::a[1]
)
当评估产生以下想要的结果时:
when evaluated produces the following wanted result:
"Prerequisite: ANT102H5"
注意:在这种特殊情况下,不需要缩写 //
并且实际上应始终尽可能避免使用该缩写,因为它会导致对表达式,在许多情况下导致完整的(子)树遍历.我故意使用//",因为提供的 XML 片段没有为我们提供 XML 文档的完整结构.此外,这演示了如何正确索引使用 //
的结果(注意周围的括号)——有助于防止在尝试这样做时出现非常频繁的错误
Note: In this particular case the abbreviation //
is not needed and in fact this abbreviation should always when possible be avoided, because it leads to slower evaluation of the expression, causing in many cases a complete (sub) tree traversal. I am using '//' intentionally, because the provided XML fragment doesn't give us the full structure of the XML document. Also, This demonstrates how to correctly index the results of using //
(note the surrounding brackets) -- helping to prevent a very frequent mistake in trying to do so
UPDATE:OP 请求了一个 XPath 表达式来选择所有需要的文本节点——这里是:
UPDATE: The OP has requested a single XPath expression that selects all the required text nodes -- here it is:
/*/p/text()[1]
|
(//span[@class='title2'])[1]/text()
|
(//span[@class='title2'])[1]/following-sibling::a[1]/text()
|
(//span[@class='title2'])[2]/text()
|
(//span[@class='title2'])[2]/following-sibling::a[1]/text()
当应用于与上述相同的 XML 文档时,文本节点的连接正是所需要的:
When applied on the same XML document as above, the concatenation of the text nodes is exactly what is required:
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5Prerequisite: ANT102H5
可以通过运行以下 XSLT 转换来确认此结果:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/p/text()[1]
|
(//span[@class='title2'])[1]/text()
|
(//span[@class='title2'])[1]/following-sibling::a[1]/text()
|
(//span[@class='title2'])[2]/text()
|
(//span[@class='title2'])[2]/following-sibling::a[1]/text()
"/>
</xsl:template>
</xsl:stylesheet>
当此转换应用于同一个 XML 文档(在此答案中先前指定)时,会产生所需的正确结果:
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5Prerequisite: ANT102H5
最后:下面的单个 XPath 表达式准确地选择 HTML 页面中所有想要的文本节点,以及提供的链接(在将其整理成格式良好的 XML 之后):
Finally: The following single XPath expression selects exactly all wanted text node in the HTML page, with the provided link (after tidying it to become well-formed XML):
(//p[@class='titlestyle'])[2]/text()[1]
|
(//span[@class='title2'])[2]/text()
|
(//span[@class='title2'])[2]/following-sibling::a[1]/text()
|
(//span[@class='title2'])[3]/text()
|
(//span[@class='title2'])[3]/following-sibling::a[1]/text()
这篇关于XPath:通过当前节点属性选择当前和下一个节点的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!