BeautifulSoup在找到的标签之后找到下一个特定标签 [英] BeautifulSoup find the next specific tag following a found tag
问题描述
给出以下内容(从较大的文档中简化)
Given the following (simplified from a larger document)
<tr class="row-class">
<td>Age</td>
<td>16</td>
</tr>
<tr class="row-class">
<td>Height</td>
<td>5.6</td>
</tr>
<tr class="row-class">
<td>Weight</td>
<td>103.4</td>
</tr>
我尝试使用bs4
和lxml
从相应的行中返回16
.问题似乎是两个td
标记之间有一个Navigable String
,因此
I have tried to return the 16
from the appropriate row using bs4
and lxml
. The issue seems to be that there is a Navigable String
between the two td
tags, so that
page.find_all("tr", {"class":"row-class"})
产生一个带有
result[0] = {Tag} <tr class="row-class"> <td>Age</td> <td>16</td> </tr>
result[1] = {Tag} <tr class="row-class"> <td>Height</td> <td>5.6</td> </tr>
result[2] = {Tag} <tr class="row-class"> <td>Weight</td> <td>103.4</td> </tr>
很棒,但是我不能在第二个td
中获得字符串.每行的内容类似于
which is great, but I can't get the string in the second td
. The contents of each of these rows is similar to
[' ', <td>Age</td>, ' ', <td>16</td>, ' ']
,其中td
是tag
,而' '
是Navigable String
.这种差异使我无法使用next_element
或next_sibling
便捷方法通过以下方式访问正确的文本:
with the td
being a tag
and the ' '
being a Navigable String
. This difference is preventing me from using the next_element
or next_sibling
convenience methods to access the correct text with something like:
如果我使用:
find("td", text=re.compile(r'Age')).get_text()
我得到Age
.但是如果我尝试通过
I get Age
. But if I try to access the next element via
find("td", text=re.compile(r'Age')).next_element()
我知道
"NavigableString"对象不可调用
'NavigableString' object is not callable
由于在result
中包裹了NavigableStrings
,所以用previous_element
向后移动也有同样的问题.
Because of the wrapping NavigableStrings
in the result
, moving backwards with previous_element
has the same problem.
如何从找到的Tag
移至下一个Tag
,而在两者之间跳过next_element
?有没有办法从result
中删除这些' '
?
How do I move from the found Tag
to the next Tag
, skipping the next_element
in between? Is there a way to remove these ' '
from the result
?
我应该指出,我已经尝试过务实,例如:
I should point out that I've already tried to be pragmatic with something like:
for r in (sp.find_all("tr", {"class":"row-class"})):
age = r.find("td", text=re.compile(r"\d\d")).get_text()
它起作用...直到我解析一个文档,该文档在Age
之前具有匹配的\d\d
另一个顺序.
it works ... until I parse a document that has another order with a matching \d\d
before Age
.
我也知道我可以
find("td", text=re.compile(r'Age')).next_sibling.next_sibling
但是,这很难烘焙其中的结构.
but that is hard-baking the structure in.
因此,我需要在搜索中进行具体说明,找到具有目标字符串的td
,然后在下一个td
中找到值.我知道我可以构建一些逻辑来测试每一行,但是似乎我缺少了一些明显且更优雅的东西……
So I need to be specific in the search and find the td
that has the target string, then find the value in the next td
. I know I could build some logic that tests each row, but it seems like I'm missing something obvious and more elegant...
推荐答案
如果获取元素列表,则可以使用[index]
从列表中获取元素.
if you get list of elements then you can use [index]
to get element from list.
data = """<tr class="row-class">
<td>Age</td>
<td>16</td>
</tr>
<tr class="row-class">
<td>Height</td>
<td>5.6</td>
</tr>
<tr class="row-class">
<td>Weight</td>
<td>103.4</td>
</tr>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data)
trs = soup.find_all("tr", {"class":"row-class"})
for tr in trs:
tds = tr.find_all("td") # you get list
print('text:', tds[0].get_text()) # get element [0] from list
print('value:', tds[1].get_text()) # get element [1] from list
结果
text: Age
value: 16
text: Height
value: 5.6
text: Weight
value: 103.4
这篇关于BeautifulSoup在找到的标签之后找到下一个特定标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!