BeautifulSoup:只要进入标签内部,无论有多少封闭标签 [英] BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

查看:21
本文介绍了BeautifulSoup:只要进入标签内部,无论有多少封闭标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup 从网页中的 <p> 元素中抓取所有内部 html.有内部标签,但我不在乎,我只想获取内部文本.

例如,对于:

红色

<p><i>蓝色</i></p><p>黄色</p><p>浅<b>绿</b></p>

我如何提取:

红色蓝色黄色的浅绿色

.string.contents[0] 都不是我需要的..extract() 也没有,因为我不想事先指定内部标签 - 我想处理可能发生的任何事情.

BeautifulSoup 中是否有只获取可见的 HTML"类型的方法?

----更新------

根据建议,尝试:

soup = BeautifulSoup(open("test.html"))p_tags = 汤.findAll('p',text=True)对于 i,enumerate(p_tags) 中的 p_tag:打印 str(i) + p_tag

但这无济于事 - 它会打印出来:

0Red12蓝34黄色56灯7绿色8

解决方案

简答:soup.findAll(text=True)

这已经得到了回答,这里在 StackOverflowBeautifulSoup 文档.

更新:

澄清一下,一段工作代码:

<预><代码>>>>txt = """... <p>红色</p>... <p><i>蓝色</i></p>... <p>黄色</p>... <p>浅<b>绿</b></p>……">>>进口美汤>>>BeautifulSoup.__version__'3.0.7a'>>>汤 = BeautifulSoup.BeautifulSoup(txt)>>>对于soup.findAll('p')中的节点:... 打印 ''.join(node.findAll(text=True))红色的蓝色黄色的浅绿色

I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.

For example, for:

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

How can I extract:

Red
Blue
Yellow
Light green

Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.

Is there a 'just get the visible HTML' type of method in BeautifulSoup?

----UPDATE------

On advice, trying:

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

But that doesn't help - it prints out:

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8

解决方案

Short answer: soup.findAll(text=True)

This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.

UPDATE:

To clarify, a working piece of code:

>>> txt = """
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

这篇关于BeautifulSoup:只要进入标签内部,无论有多少封闭标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆