如何解析以下html代码可获取"br"之前的所有文本.标签 [英] How to parse the following html code get all text before "br" tag

查看:76
本文介绍了如何解析以下html代码可获取"br"之前的所有文本.标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下html代码:

I have the following html code:

    <td class="role" style=""><a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Microsoft">Microsoft</a><br />
    <a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Corbis">Corbis</a><br />
    Co-Chair of the <a href="/wiki/Bill_%26_Melinda_Gates_Foundation">Bill &amp; Melinda   Gates Foundation</a><br />
    <a href="/wiki/Creative_Director" title="Creative Director" class="mw- redirect">Director</a> of <a href="/wiki/Berkshire_Hathaway">Berkshire Hathaway</a><br/>
    <a href="/wiki/CEO" class="mw-redirect" title="CEO">CEO</a> of <a  href="/wiki/Cascade_Investment">Cascade Investment</a></td>

对于上面的td元素,从语义上讲有五行,用"<br/>"分隔,我想得到五行为:

For the above td element, semantically there are five rows, separated by "<br/>", I want to get the five lines as:

Chairman of Microsoft

Chariman of Borbis

Co-Char of the Bill&Melinda Gates Fundation

Creative Director of Berkshire Hathaway

CEO of Cascade Investment

当前,我的解决方案是首先将所有br放入此td中,如下所示:

Currently, my solution is to first get all br inside this td, as:

    br_value = td_node.select('.//br')

然后对于每个br_value,我使用以下代码获取所有文本:

then for each br_value, I use the following code to get all text:

    for br_item in br_value:
        one_item = br_item.select('.//preceding-sibling::*/text()').extract()

在这种情况下,我可以将行显示为:

In this case, I can get the line as:

Chairman Microsoft

Chariman Borbis

Bill&Melinda Gates Fundation

Director Berkshire Hathaway

CEO Cascade Investment

与我想要的原始文本相比,他们基本上错过了"of",以及其他一些文本.

Compared with the original text I want, they basically missed "of", also some other texts.

这样做的原因是"preceding-sibling"仅返回同级标记,而不能返回属于其父级的"text",例如本例中的"of".

The reason for this is that "preceding-sibling" only return the sibling tags, but can't return the "text" which belongs to its parent, such as "of" in this case.

这里的任何人都知道如何提取由br标记分隔的完整信息吗?

Anyone here know how to extract the complete information separated by br tag?

谢谢

推荐答案

使用

Use this xpath query:

//div[@id='???']/descendant-or-self::*[not(ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]/text()

即要从当前节点和所有后代节点中仅选择文本,请使用以下查询:./descendant-or-self::*/text()

I.e. to select just text from current and all descendant nodes, use this kind of query: ./descendant-or-self::*/text()

或更短(感谢Empo):.//text()

Or shorter (thanks to Empo): .//text()

这篇关于如何解析以下html代码可获取"br"之前的所有文本.标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆