如何从 <dt> 中获取文本?带有 <span> 的标签里面? [英] How can I get text out of a <dt> tag with a <span> inside?

查看:23
本文介绍了如何从 <dt> 中获取文本?带有 <span> 的标签里面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 www.uszip.com 上带有 <dt> 标签中提取文本:

I'm trying to extract the text from inside a <dt> tag with a <span> inside on www.uszip.com:

这是我想要得到的一个例子:

Here is an example of what I'm trying to get:

<dt>Land area<br><span class="stype">(sq. miles)</span></dt>
<dd>14.28</dd>

我想从标签中取出 14.28.这就是我目前的处理方式:

I want to get the 14.28 out of the tag. This is how I'm currently approaching it:

注:soup 是整个网页源码的 BeautifulSoup 版本:

Note: soup is the BeautifulSoup version of the entire webpage's source code:

soup.find("dt",text="Land area").contents[0]

然而,这给了我一个

AttributeError: 'NoneType' object has no attribute 'contents'

我尝试了很多方法,但不知道如何解决这个问题.此方法适用于此页面上的其他一些数据,例如:

I've tried a lot of things and I'm not sure how to approach this. This method works for some of the other data on this page, like:

<dt>Total population</dt>
<dd>22,234<span class="trend trend-down" title="-15,025 (-69.77% since 2000)">&#9660;</span></dd>

使用 soup.find("dt",text="Total population").next_sibling.contents[0] 返回 '22,234'.

我应该如何尝试首先识别正确的标签,然后从中获取正确的数据?

How should I try to first identify the correct tag and then get the right data out of it?

推荐答案

遗憾的是,您无法仅根据所包含的文本来匹配带有文本和嵌套标签的标签.

Unfortunately, you cannot match tags with both text and nested tags, based on the contained text alone.

您必须遍历所有

没有 文本:

You'd have to loop over all <dt> without text:

for dt in soup.find_all('dt', text=False):
    if 'Land area' in dt.text:
        print dt.contents[0]

这听起来违反直觉,但此类标签的 .string 属性为空,而这正是 BeautifulSoup 所匹配的..text 包含组合的所有嵌套标签中的所有字符串,并且不匹配.

This sounds counter-intuitive, but the .string attribute for such tags is empty, and that is what BeautifulSoup is matching against. .text contains all strings in all nested tags combined, and that is not matched against.

您也可以使用自定义函数来做搜索:

You could also use a custom function to do the search:

soup.find_all(lambda t: t.name == 'dt' and 'Land area' in t.text)

本质上使用封装在 lambda 函数中的过滤器进行相同的搜索.

which essentially does the same search with the filter encapsulated in a lambda function.

这篇关于如何从 &lt;dt&gt; 中获取文本?带有 &lt;span&gt; 的标签里面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆