使用 beautifulsoup 提取换行符之间的文本(例如 <br/> 标签) [英] Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)

查看:17
本文介绍了使用 beautifulsoup 提取换行符之间的文本(例如 <br/> 标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个较大的文档中有以下 HTML

I have the following HTML that is within a larger document

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />

我目前正在使用 BeautifulSoup 来获取 HTML 中的其他元素,但我还没有找到一种方法来获取 <br/> 标记之间的重要文本行.我可以隔离并导航到每个 <br/> 元素,但找不到一种方法来获取它们之间的文本.任何帮助将不胜感激.谢谢.

I'm currently using BeautifulSoup to obtain other elements within the HTML, but I have not been able to find a way to get the important lines of text between <br /> tags. I can isolate and navigate to each of the <br /> elements, but can't find a way to get the text in between. Any help would be greatly appreciated. Thanks.

推荐答案

如果您只想要位于两个 <br/> 标签之间的任何文本,您可以执行以下操作:

If you just want any text which is between two <br /> tags, you could do something like the following:

from BeautifulSoup import BeautifulSoup, NavigableString, Tag

input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''

soup = BeautifulSoup(input)

for br in soup.findAll('br'):
    next_s = br.nextSibling
    if not (next_s and isinstance(next_s,NavigableString)):
        continue
    next2_s = next_s.nextSibling
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
        text = str(next_s).strip()
        if text:
            print "Found:", next_s

但也许我误解了你的问题?您对问题的描述似乎与示例数据中的重要"/不重要"不符,所以我已经使用了描述;)

But perhaps I misunderstand your question? Your description of the problem doesn't seem to match up with the "important" / "non important" in your example data, so I've gone with the description ;)

这篇关于使用 beautifulsoup 提取换行符之间的文本(例如 &lt;br/&gt; 标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆