元素之间Python的BeautifulSoup提取文本 [英] Python BeautifulSoup extract text between element

查看:141
本文介绍了元素之间Python的BeautifulSoup提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试提取这是我TEXT从以下HTML:

 < HTML和GT;
<身体GT;
<表>
   < TD类=MYCLASS>
      <! - 评论 - >
      <一个HEF =XY>文字< / A>
      &所述p为H.;东西&下; / P>
      这是我的文本
      &所述p为H.;别的&下; / P>
      < / BR>
   < / TD>
< /表>
< /身体GT;
< / HTML>

我尝试了这种方式:

 汤= BeautifulSoup(HTML)在soup.findAll命中(ATTRS = {'类':'MYCLASS'}):
    打印hit.text

但我得到的所有所有嵌套的标签加注释的文本。

谁能帮我只得到这是我TEXT出来呢?


解决方案

了解更多有关如何定位的通过 BeautifulSoup 解析树。解析树得到了标记 NavigableStrings (因为这是一个文本)。一个例子

 从BeautifulSoup进口BeautifulSoup
DOC = ['< HTML和GT;< HEAD><标题>页面标题< /标题>< /头>,
       '<身体GT;&LT,P n =firstparaALIGN =中心>这是款< B>吲; / B&取代。,
       '< p n =secondparaALIGN =嗒嗒>这是款< B>二< / B&取代。,
       '< / HTML>']
汤= BeautifulSoup(''。加入(DOC))打印汤。prettify()
#< HTML和GT;
#< HEAD>
#<标题>
# 页面标题
#< /标题>
#< /头>
#<身体GT;
#< p n =firstparaALIGN =中心>
#这是款
#< B>
#1
#< / B>
#。
#&所述; / P>
#< p n =secondparaALIGN =嗒嗒>
#这是款
#< B>
#两个
#< / B>
#。
#&所述; / P>
#< /身体GT;
#< / HTML>

要下移你有内容的解析树字符串



  •   

    内容是标签和NavigableString对象的有序列表
      包含在页面元素中




  •   

    如果一个标签只有一个子节点,而子节点是一个字符串,
      子节点是可用作为tag.string,以及
      tag.contents [0]



针对上述情况,也就是说,你可以得到

  soup.b.string
#u'one
soup.b.contents [0]
#u'one

有关几个孩子节点,你可以有实例

  PTAG = soup.p
pTag.contents
#[u'This是款,< B>吲; / B>中U'']

所以在这里,你可以用内容播放,并得到你想要的索引内容。

您还可以通过标签迭代,这是一条捷径。例如,

 对我soup.body:
    我打印
#< p n =firstparaALIGN =中心>这是款< B>吲; / B>< / P>
#< p n =secondparaALIGN =嗒嗒>这是款< B>二< / B>< / P>

I try to extract "THIS IS MY TEXT" from the following HTML:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

I tried it this way:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

But I get all the text between all nested Tags plus the comment.

Can anyone help me to just get "THIS IS MY TEXT" out of this?

解决方案

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

To move down the parse tree you have contents and string.

  • contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

For the above, that is to say you can get

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

For several children nodes, you can have for instance

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

so here you may play with contents and get contents at the index you want.

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

这篇关于元素之间Python的BeautifulSoup提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆