如何解析以下html代码可获取"br"之前的所有文本.标签 [英] How to parse the following html code get all text before "br" tag

查看：76 发布时间：2020/11/24 21:05:11 xpath html-parsing

本文介绍了如何解析以下html代码可获取"br"之前的所有文本.标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下html代码:

I have the following html code:

    <td class="role" style=""><a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Microsoft">Microsoft</a><br />
    <a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Corbis">Corbis</a><br />
    Co-Chair of the <a href="/wiki/Bill_%26_Melinda_Gates_Foundation">Bill &amp; Melinda   Gates Foundation</a><br />
    <a href="/wiki/Creative_Director" title="Creative Director" class="mw- redirect">Director</a> of <a href="/wiki/Berkshire_Hathaway">Berkshire Hathaway</a><br/>
    <a href="/wiki/CEO" class="mw-redirect" title="CEO">CEO</a> of <a  href="/wiki/Cascade_Investment">Cascade Investment</a></td>

对于上面的td元素，从语义上讲有五行，用"<br/>"分隔，我想得到五行为:

For the above td element, semantically there are five rows, separated by "<br/>", I want to get the five lines as:

Chairman of Microsoft

Chariman of Borbis

Co-Char of the Bill&Melinda Gates Fundation

Creative Director of Berkshire Hathaway

CEO of Cascade Investment

当前，我的解决方案是首先将所有br放入此td中，如下所示:

Currently, my solution is to first get all br inside this td, as:

    br_value = td_node.select('.//br')

然后对于每个br_value，我使用以下代码获取所有文本:

then for each br_value, I use the following code to get all text:

    for br_item in br_value:
        one_item = br_item.select('.//preceding-sibling::*/text()').extract()

在这种情况下，我可以将行显示为:

In this case, I can get the line as:

Chairman Microsoft

Chariman Borbis

Bill&Melinda Gates Fundation

Director Berkshire Hathaway

CEO Cascade Investment

与我想要的原始文本相比，他们基本上错过了"of"，以及其他一些文本.

Compared with the original text I want, they basically missed "of", also some other texts.

这样做的原因是"preceding-sibling"仅返回同级标记，而不能返回属于其父级的"text"，例如本例中的"of".

The reason for this is that "preceding-sibling" only return the sibling tags, but can't return the "text" which belongs to its parent, such as "of" in this case.

这里的任何人都知道如何提取由br标记分隔的完整信息吗?

Anyone here know how to extract the complete information separated by br tag?

谢谢

如何解析以下html代码可获取"br"之前的所有文本.标签 [英] How to parse the following html code get all text before "br" tag

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何解析以下html代码可获取"br"之前的所有文本.标签 [英] How to parse the following html code get all text before &quot;br&quot; tag

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何解析以下html代码可获取"br"之前的所有文本.标签 [英] How to parse the following html code get all text before "br" tag

登录关闭