元素之间Python的BeautifulSoup提取文本 [英] Python BeautifulSoup extract text between element
问题描述
我尝试提取这是我TEXT从以下HTML:
< HTML和GT;
<身体GT;
<表>
< TD类=MYCLASS>
<! - 评论 - >
<一个HEF =XY>文字< / A>
&所述p为H.;东西&下; / P>
这是我的文本
&所述p为H.;别的&下; / P>
< / BR>
< / TD>
< /表>
< /身体GT;
< / HTML>
我尝试了这种方式:
汤= BeautifulSoup(HTML)在soup.findAll命中(ATTRS = {'类':'MYCLASS'}):
打印hit.text
但我得到的所有所有嵌套的标签加注释的文本。
谁能帮我只得到这是我TEXT出来呢?
了解更多有关如何定位的通过 BeautifulSoup
解析树。解析树得到了标记
和 NavigableStrings
(因为这是一个文本)。一个例子
从BeautifulSoup进口BeautifulSoup
DOC = ['< HTML和GT;< HEAD><标题>页面标题< /标题>< /头>,
'<身体GT;&LT,P n =firstparaALIGN =中心>这是款< B>吲; / B&取代。,
'< p n =secondparaALIGN =嗒嗒>这是款< B>二< / B&取代。,
'< / HTML>']
汤= BeautifulSoup(''。加入(DOC))打印汤。prettify()
#< HTML和GT;
#< HEAD>
#<标题>
# 页面标题
#< /标题>
#< /头>
#<身体GT;
#< p n =firstparaALIGN =中心>
#这是款
#< B>
#1
#< / B>
#。
#&所述; / P>
#< p n =secondparaALIGN =嗒嗒>
#这是款
#< B>
#两个
#< / B>
#。
#&所述; / P>
#< /身体GT;
#< / HTML>
要下移你有内容的解析树
和字符串
。
-
内容是标签和NavigableString对象的有序列表
包含在页面元素中
块引用> -
如果一个标签只有一个子节点,而子节点是一个字符串,
子节点是可用作为tag.string,以及
tag.contents [0]
块引用>
针对上述情况,也就是说,你可以得到
soup.b.string
#u'one
soup.b.contents [0]
#u'one
有关几个孩子节点,你可以有实例
PTAG = soup.p
pTag.contents
#[u'This是款,< B>吲; / B>中U'']
所以在这里,你可以用内容播放
,并得到你想要的索引内容。的
您还可以通过标签迭代,这是一条捷径。例如,
对我soup.body:
我打印
#< p n =firstparaALIGN =中心>这是款< B>吲; / B>< / P>
#< p n =secondparaALIGN =嗒嗒>这是款< B>二< / B>< / P>
I try to extract "THIS IS MY TEXT" from the following HTML:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
I tried it this way:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
But I get all the text between all nested Tags plus the comment.
Can anyone help me to just get "THIS IS MY TEXT" out of this?
Learn more about how to navigate through the parse tree in BeautifulSoup
. Parse tree has got tags
and NavigableStrings
(as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contents
and string
.
contents is an ordered list of the Tag and NavigableString objects contained within a page element
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]
For the above, that is to say you can get
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contents
and get contents at the index you want.
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body:
print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
这篇关于元素之间Python的BeautifulSoup提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!