BeautifulSoup:只要进入标签内部,无论有多少封闭标签 [英] BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are
问题描述
我正在尝试使用 BeautifulSoup 从网页中的 <p>
元素中抓取所有内部 html.有内部标签,但我不在乎,我只想获取内部文本.
例如,对于:
红色
<p><i>蓝色</i></p><p>黄色</p><p>浅<b>绿</b></p>
我如何提取:
红色蓝色黄色的浅绿色
.string
和 .contents[0]
都不是我需要的..extract()
也没有,因为我不想事先指定内部标签 - 我想处理可能发生的任何事情.
BeautifulSoup 中是否有只获取可见的 HTML"类型的方法?
----更新------
根据建议,尝试:
soup = BeautifulSoup(open("test.html"))p_tags = 汤.findAll('p',text=True)对于 i,enumerate(p_tags) 中的 p_tag:打印 str(i) + p_tag
但这无济于事 - 它会打印出来:
0Red12蓝34黄色56灯7绿色8
简答:soup.findAll(text=True)
这已经得到了回答,这里在 StackOverflow 和 BeautifulSoup 文档.
更新:
澄清一下,一段工作代码:
<预><代码>>>>txt = """... <p>红色</p>... <p><i>蓝色</i></p>... <p>黄色</p>... <p>浅<b>绿</b></p>……">>>进口美汤>>>BeautifulSoup.__version__'3.0.7a'>>>汤 = BeautifulSoup.BeautifulSoup(txt)>>>对于soup.findAll('p')中的节点:... 打印 ''.join(node.findAll(text=True))红色的蓝色黄色的浅绿色I'm trying to scrape all the inner html from the <p>
elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.
For example, for:
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
How can I extract:
Red
Blue
Yellow
Light green
Neither .string
nor .contents[0]
does what I need. Nor does .extract()
, because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.
Is there a 'just get the visible HTML' type of method in BeautifulSoup?
----UPDATE------
On advice, trying:
soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags):
print str(i) + p_tag
But that doesn't help - it prints out:
0Red
1
2Blue
3
4Yellow
5
6Light
7green
8
Short answer: soup.findAll(text=True)
This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.
UPDATE:
To clarify, a working piece of code:
>>> txt = """
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
... print ''.join(node.findAll(text=True))
Red
Blue
Yellow
Light green
这篇关于BeautifulSoup:只要进入标签内部,无论有多少封闭标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!