BeautifulSoup-如何在指定的字符串后提取文本 [英] BeautifulSoup - How to extract text after specified string
问题描述
我有类似HTML的
<tr>
<td>Title:</td>
<td>Title value</td>
</tr>
我必须指定带有文本的<td>
之后要获取第二个<td>
的文本.类似于:抓取<td>
之后的第一个下一个<td>
的文本,其中包含文本Title:
.结果应为:Title value
I have to specify after which <td>
with text i want to grab text of second <td>
. Something like: Grab text of first next <td>
after <td>
which contain text Title:
. Result should be: Title value
我对Python和BeutifulSoupno有一些基本的了解,而且我不知道在没有class
可以指定的情况下该怎么做.
I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class
to specify.
我已经尝试过了:
row = soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)
,我收到错误:AttributeError:'ResultSet' object has no attribute 'nextSibling'
and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'
推荐答案
首先,soup.find_all()
返回一个ResultSet
,其中包含所有带有标签td
且字符串为Title:
的元素.
First of all, soup.find_all()
returns a ResultSet
which contains all the elements with tag td
and string as Title:
.
对于结果集中的每个此类元素,您将需要单独获取nextSibling(同样,您应该循环遍历,直到找到标记td
的nextSibling为止,因为您可以在它们之间获取其他元素(例如NavigableString )).
For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td
, since you can get other elements in between (like a NavigableString)).
示例-
>>> from bs4 import BeautifulSoup
>>> s="""<tr>
... <td>Title:</td>
... <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row = soup.find_all('td', string='Title:')
>>> for r in row:
... nextSib = r.nextSibling
... while nextSib.name != 'td' and nextSib is not None:
... nextSib = nextSib.nextSibling
... print(nextSib.text)
...
Title value
或者您可以使用另一个支持XPATH的库,并且使用Xpath可以轻松地做到这一点.其他库--lxml
或xml.etree
.
这篇关于BeautifulSoup-如何在指定的字符串后提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!