从html文档中提取href的文本 [英] Extracting an href's text from an html document
问题描述
我试图解析这段HTML:
< div>
< p>
< a href =#class =transcriptLinkonclick =seekVideo(0); return false;>几年前,< / a>
< a href =#class =transcriptLinkonclick =seekVideo(2000); return false;>我觉得自己陷入了一种r&,< / a>
< a href =#class =transcriptLinkonclick =seekVideo(5000); return false;>所以我决定按照脚步< / a>
< a href =#class =transcriptLinkonclick =seekVideo(7000); return false;>伟大的美国哲学家摩根史派洛克< / a>
< a href =#class =transcriptLinkonclick =seekVideo(10000); return false;>并尝试一些新的功能30天。< / a>
< / p>
< / div>
我想知道如何获得标签中的文字,例如:几年前,
我可以在< a>文字< / a>,
但是我不知道如何在< a href =#class =transcriptLink标签中获得几年前 onclick =seekVideo(0);返回false;>几年前,< / a>
< a href =#class =transcriptLinkonclick =seekVideo(0); return false;>
< a href =#class =transcriptLinkonclick =seekVideo(2000); return false;>
....................
只有 onclick =seekVideo(....);
/ div / p / a [1] / text()
- 选择 a $ (@onclick,'seekVideo(0)')通过索引或匹配 @onclick
值: / div / p / ')] / text()
。因此这两个查询都会在几年前返回。
。 要获得 @onclick
seekVideo
中的号码,您可以使用以下表达式:
substring-before(substring-after(@onclick,'('),')')
$例如:要找到
a
,其@onclick
code> seekVideo =
0
您可以使用此XPath:
<$ p $ (@onclick,'('),')')='0'] / text()
或
/ div / p / A [数(串 - 前(substring-after(@onclick,'('),')'))= 0] / text()
$ b $因此两个查询都会在几年前返回
。
I'm trying to parse this piece of HTML:
<div>
<p>
<a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">A few years ago,</a>
<a href="#" class="transcriptLink" onclick="seekVideo(2000); return false;">I felt like I was stuck in a rut,</a>
<a href="#" class="transcriptLink" onclick="seekVideo(5000); return false;">so I decided to follow in the footsteps</a>
<a href="#" class="transcriptLink" onclick="seekVideo(7000); return false;">of the great American philosopher, Morgan Spurlock,</a>
<a href="#" class="transcriptLink" onclick="seekVideo(10000); return false;">and try something new for 30 days.</a>
</p>
</div>
I want to know how to get the text in label, such as: "A few years ago,"
I can get text in "<a> text </a>",
But I do not know how to get "A few years ago," in the label of "<a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">A few years ago,</a> "
<a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">
<a href="#" class="transcriptLink" onclick="seekVideo(2000); return false;">
....................
There are different about only onclick="seekVideo(....);
You can use XPath: /div/p/a[1]/text()
- selects a
by index or matching @onclick
value: /div/p/a[starts-with(@onclick, 'seekVideo(0)')]/text()
. So both queries return A few years ago,
.
To get number in @onclick
seekVideo
you can use this expression:
substring-before(substring-after(@onclick, '('), ')')
e.g.: To find a
whose @onclick
seekVideo
= 0
you can use this XPath:
/div/p/a[substring-before(substring-after(@onclick, '('), ')') = '0']/text()
or
/div/p/a[number(substring-before(substring-after(@onclick, '('), ')')) = 0]/text()
So both queries return A few years ago,
.
这篇关于从html文档中提取href的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!