BeautifulSoup在找到的标签之后找到下一个特定标签 [英] BeautifulSoup find the next specific tag following a found tag

查看：1597 发布时间：2020/9/20 8:22:02 python parsing beautifulsoup

本文介绍了BeautifulSoup在找到的标签之后找到下一个特定标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给出以下内容(从较大的文档中简化)

Given the following (simplified from a larger document)

<tr class="row-class">
  <td>Age</td>
  <td>16</td>
</tr>
<tr class="row-class">
  <td>Height</td>
  <td>5.6</td>
</tr>
<tr class="row-class">
  <td>Weight</td>
  <td>103.4</td>
</tr>

我尝试使用bs4和lxml从相应的行中返回16.问题似乎是两个td标记之间有一个Navigable String，因此

I have tried to return the 16 from the appropriate row using bs4 and lxml. The issue seems to be that there is a Navigable String between the two td tags, so that

page.find_all("tr", {"class":"row-class"})

产生一个带有

result[0] = {Tag} <tr class="row-class"> <td>Age</td> <td>16</td> </tr>
result[1] = {Tag} <tr class="row-class"> <td>Height</td> <td>5.6</td> </tr>
result[2] = {Tag} <tr class="row-class"> <td>Weight</td> <td>103.4</td> </tr>

很棒，但是我不能在第二个td中获得字符串.每行的内容类似于

which is great, but I can't get the string in the second td. The contents of each of these rows is similar to

[' ', <td>Age</td>, ' ', <td>16</td>, ' ']

，其中td是tag，而' '是Navigable String.这种差异使我无法使用next_element或next_sibling便捷方法通过以下方式访问正确的文本:

with the td being a tag and the ' ' being a Navigable String. This difference is preventing me from using the next_element or next_sibling convenience methods to access the correct text with something like:

如果我使用:

find("td", text=re.compile(r'Age')).get_text()

我得到Age.但是如果我尝试通过

I get Age. But if I try to access the next element via

find("td", text=re.compile(r'Age')).next_element()

我知道

"NavigableString"对象不可调用

'NavigableString' object is not callable

由于在result中包裹了NavigableStrings，所以用previous_element向后移动也有同样的问题.

Because of the wrapping NavigableStrings in the result, moving backwards with previous_element has the same problem.

如何从找到的Tag移至下一个Tag，而在两者之间跳过next_element?有没有办法从result中删除这些' '?

How do I move from the found Tag to the next Tag, skipping the next_element in between? Is there a way to remove these ' ' from the result?

我应该指出，我已经尝试过务实，例如:

I should point out that I've already tried to be pragmatic with something like:

    for r in (sp.find_all("tr", {"class":"row-class"})):
        age = r.find("td", text=re.compile(r"\d\d")).get_text()

它起作用...直到我解析一个文档，该文档在Age之前具有匹配的\d\d另一个顺序.

it works ... until I parse a document that has another order with a matching \d\d before Age.

我也知道我可以

find("td", text=re.compile(r'Age')).next_sibling.next_sibling

但是，这很难烘焙其中的结构.

but that is hard-baking the structure in.

因此，我需要在搜索中进行具体说明，找到具有目标字符串的td，然后在下一个td中找到值.我知道我可以构建一些逻辑来测试每一行，但是似乎我缺少了一些明显且更优雅的东西……

So I need to be specific in the search and find the td that has the target string, then find the value in the next td. I know I could build some logic that tests each row, but it seems like I'm missing something obvious and more elegant...

BeautifulSoup在找到的标签之后找到下一个特定标签 [英] BeautifulSoup find the next specific tag following a found tag

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

BeautifulSoup在找到的标签之后找到下一个特定标签 [英] BeautifulSoup find the next specific tag following a found tag

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭