Python BeautifulSoup 提取元素之间的文本 [英] Python BeautifulSoup extract text between element
问题描述
我尝试从以下 HTML 中提取这是我的文本":
I try to extract "THIS IS MY TEXT" from the following HTML:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
我是这样试的:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
但是我得到了所有嵌套标签和评论之间的所有文本.
But I get all the text between all nested Tags plus the comment.
谁能帮我把这是我的文字"弄出来?
Can anyone help me to just get "THIS IS MY TEXT" out of this?
推荐答案
了解有关如何导航的更多信息 通过BeautifulSoup
中的解析树.解析树有 tags
和 NavigableStrings
(因为这是一个文本).一个例子
Learn more about how to navigate through the parse tree in BeautifulSoup
. Parse tree has got tags
and NavigableStrings
(as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
要向下移动解析树,您有 contents
和 string
.
To move down the parse tree you have contents
and string
.
contents 是 Tag 和 NavigableString 对象的有序列表包含在页面元素中
contents is an ordered list of the Tag and NavigableString objects contained within a page element
如果一个标签只有一个子节点,并且该子节点是一个字符串,子节点作为 tag.string 可用,以及标签内容[0]
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]
对于上面的,也就是说你可以得到
For the above, that is to say you can get
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
对于多个子节点,你可以有例如
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
所以在这里您可以使用 contents
并在您想要的索引处获取内容.
so here you may play with contents
and get contents at the index you want.
你也可以遍历一个标签,这是一个快捷方式.例如,
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body:
print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
这篇关于Python BeautifulSoup 提取元素之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!