Python BeautifulSoup 提取元素之间的文本 [英] Python BeautifulSoup extract text between element

查看:24
本文介绍了Python BeautifulSoup 提取元素之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从以下 HTML 中提取这是我的文本":

I try to extract "THIS IS MY TEXT" from the following HTML:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

我是这样试的:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

但是我得到了所有嵌套标签和评论之间的所有文本.

But I get all the text between all nested Tags plus the comment.

谁能帮我把这是我的文字"弄出来?

Can anyone help me to just get "THIS IS MY TEXT" out of this?

推荐答案

了解有关如何导航的更多信息 通过BeautifulSoup中的解析树.解析树有 tagsNavigableStrings (因为这是一个文本).一个例子

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

要向下移动解析树,您有 contentsstring.

To move down the parse tree you have contents and string.

contents 是 Tag 和 NavigableString 对象的有序列表包含在页面元素中

contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • 如果一个标签只有一个子节点,并且该子节点是一个字符串,子节点作为 tag.string 可用,以及标签内容[0]

    if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

  • 对于上面的,也就是说你可以得到

    For the above, that is to say you can get

    soup.b.string
    # u'one'
    soup.b.contents[0]
    # u'one'
    

    对于多个子节点,你可以有例如

    For several children nodes, you can have for instance

    pTag = soup.p
    pTag.contents
    # [u'This is paragraph ', <b>one</b>, u'.']
    

    所以在这里您可以使用 contents 并在您想要的索引处获取内容.

    so here you may play with contents and get contents at the index you want.

    你也可以遍历一个标签,这是一个快捷方式.例如,

    You also can iterate over a Tag, this is a shortcut. For instance,

    for i in soup.body:
        print i
    # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
    # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    

    这篇关于Python BeautifulSoup 提取元素之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆