从html文档中提取href的文本 [英] Extracting an href's text from an html document

查看:139
本文介绍了从html文档中提取href的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析这段HTML:

 < div> 
< p>
< a href =#class =transcriptLinkonclick =seekVideo(0); return false;>几年前,< / a>
< a href =#class =transcriptLinkonclick =seekVideo(2000); return false;>我觉得自己陷入了一种r&,< / a>
< a href =#class =transcriptLinkonclick =seekVideo(5000); return false;>所以我决定按照脚步< / a>
< a href =#class =transcriptLinkonclick =seekVideo(7000); return false;>伟大的美国哲学家摩根史派洛克< / a>
< a href =#class =transcriptLinkonclick =seekVideo(10000); return false;>并尝试一些新的功能30天。< / a>
< / p>
< / div>

我想知道如何获得标签中的文字,例如:几年前,



我可以在< a>文字< / a>,



但是我不知道如何在< a href =#class =transcriptLink标签中获得几年前 onclick =seekVideo(0);返回false;>几年前,< / a>

 < a href =#class =transcriptLinkonclick =seekVideo(0); return false;> 
< a href =#class =transcriptLinkonclick =seekVideo(2000); return false;>
....................

只有 onclick =seekVideo(....);

解决方案您可以使用XPath: / div / p / a [1] / text() - 选择 a @onclick 值: / div / p / ')] / text()。因此这两个查询都会在几年前返回



要获得 @onclick seekVideo 中的号码,您可以使用以下表达式:

  substring-before(substring-after(@onclick,'('),')')
a ,其 @onclick



code> seekVideo = 0 您可以使用此XPath:

<$ p $ (@onclick,'('),')')='0'] / text()

  / div / p / A [数(串 - 前(substring-after(@onclick,'('),')'))= 0] / text()


$ b $因此两个查询都会在几年前返回


I'm trying to parse this piece of HTML:

<div>
  <p>
    <a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">A few years ago,</a>
    <a href="#" class="transcriptLink" onclick="seekVideo(2000); return false;">I felt like I was stuck in a rut,</a>
    <a href="#" class="transcriptLink" onclick="seekVideo(5000); return false;">so I decided to follow in the footsteps</a>
    <a href="#" class="transcriptLink" onclick="seekVideo(7000); return false;">of the great American philosopher, Morgan Spurlock,</a>
    <a href="#" class="transcriptLink" onclick="seekVideo(10000); return false;">and try something new for 30 days.</a>
  </p>
</div>

I want to know how to get the text in label, such as: "A few years ago,"

I can get text in "<a> text </a>",

But I do not know how to get "A few years ago," in the label of "<a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">A few years ago,</a> "

<a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">  
<a href="#" class="transcriptLink" onclick="seekVideo(2000); return false;">
....................

There are different about only onclick="seekVideo(....);

解决方案

You can use XPath: /div/p/a[1]/text() - selects a by index or matching @onclick value: /div/p/a[starts-with(@onclick, 'seekVideo(0)')]/text(). So both queries return A few years ago,.

To get number in @onclick seekVideo you can use this expression:

substring-before(substring-after(@onclick, '('), ')')

e.g.: To find a whose @onclick seekVideo = 0 you can use this XPath:

/div/p/a[substring-before(substring-after(@onclick, '('), ')') = '0']/text()

or

/div/p/a[number(substring-before(substring-after(@onclick, '('), ')')) = 0]/text()

So both queries return A few years ago,.

这篇关于从html文档中提取href的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆