Python BeautifulSoup 提取元素之间的文本 [英] Python BeautifulSoup extract text between element

查看：24 发布时间：2021/12/23 19:45:24 python beautifulsoup

本文介绍了Python BeautifulSoup 提取元素之间的文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试从以下 HTML 中提取这是我的文本":

I try to extract "THIS IS MY TEXT" from the following HTML:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

我是这样试的:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

但是我得到了所有嵌套标签和评论之间的所有文本.

But I get all the text between all nested Tags plus the comment.

谁能帮我把这是我的文字"弄出来?

Can anyone help me to just get "THIS IS MY TEXT" out of this?

推荐答案

了解有关如何导航的更多信息通过BeautifulSoup中的解析树.解析树有 tags 和 NavigableStrings (因为这是一个文本).一个例子

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

要向下移动解析树，您有 contents 和 string.

To move down the parse tree you have contents and string.

contents 是 Tag 和 NavigableString 对象的有序列表包含在页面元素中

contents is an ordered list of the Tag and NavigableString objects contained within a page element

如果一个标签只有一个子节点，并且该子节点是一个字符串，子节点作为 tag.string 可用，以及标签内容[0]

if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

对于上面的，也就是说你可以得到

For the above, that is to say you can get

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

对于多个子节点，你可以有例如

For several children nodes, you can have for instance

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

所以在这里您可以使用 contents 并在您想要的索引处获取内容.

so here you may play with contents and get contents at the index you want.

你也可以遍历一个标签，这是一个快捷方式.例如，

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

这篇关于Python BeautifulSoup 提取元素之间的文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python BeautifulSoup 提取元素之间的文本 [英] Python BeautifulSoup extract text between element

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python BeautifulSoup 提取元素之间的文本 [英] Python BeautifulSoup extract text between element

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭