如何使用python的美汤获取标签和以HTML结尾的内容之间的内容? [英] How to obtain the content between a tag and it's ending in HTML using python' beautiful soup?

查看:86
本文介绍了如何使用python的美汤获取标签和以HTML结尾的内容之间的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的HTML行如下:

<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>

我想提取标题,即对于伊夫·圣罗兰(Yves Saint Laurent)来说,这个模型太瘦了吗?"从此HTML行开始.我如何获取之间的任何内容

I would like to extract the title i.e. "Is this model too thin for Yves Saint Laurent?" from this HTML line. How can I obtain any content between

<tag> and </tag>.

我对正则表达式不是很熟悉.

I am not really familiar with regex.

推荐答案

如果您的元素仅包含 文本,请使用

If your element contains only text, use the .string attribute:

headline = soup.find(class_='cd__headline-text')
print(headline.string)

如果还包含其他标签,则可以获取当前元素中包含的所有文本并进一步获取,或者仅获取当前元素中的特定文本.

If there are other tags contained, you can either get all the text contained in the current element and further, or only get specific text from the current element.

element.get_text()函数将递归并收集元素和子元素中的所有字符串,将它们与您选择的字符串(默认为空字符串)连接在一起,并带有或不带有空格剥离.

The element.get_text() function will recurse and gather all strings in element and child elements, concatenating them with your string of choice (defaulting to the empty string) and with or without whitespace stripping.

要仅获取特定的字符串,您可以遍历元素内容以访问所有包含的元素,然后挑选NavigableString类型的实例.

To get only specific strings, you can either iterate over the .strings or .stripped_strings generators, or use the element contents to access all contained elements, then pick out instances of the NavigableString type.

演示与样本:

>>> from bs4 import BeautifulSoup
>>> markup = '<span class="cd__headline-text">Is this model too thin for Yves Saint Laurent? </span>'
>>> soup = BeautifulSoup(markup)
>>> headline = soup.find(class_='cd__headline-text')
>>> print headline.string
Is this model too thin for Yves Saint Laurent? 
>>> print list(headline.strings)
[u'Is this model too thin for Yves Saint Laurent? ']
>>> print list(headline.stripped_strings)
[u'Is this model too thin for Yves Saint Laurent?']
>>> print headline.get_text()
Is this model too thin for Yves Saint Laurent? 
>>> print headline.get_text(strip=True)
Is this model too thin for Yves Saint Laurent?

并添加了其他元素:

>>> markup = '<span class="cd__headline-text">Is this model <em>too thin</em> for Yves Saint Laurent? </span>'
>>> soup = BeautifulSoup(markup)
>>> headline = soup.find(class_='cd__headline-text')
>>> headline.string is None
True
>>> print list(headline.strings)
[u'Is this model ', u'too thin', u' for Yves Saint Laurent? ']
>>> print list(headline.stripped_strings)
[u'Is this model', u'too thin', u'for Yves Saint Laurent?']
>>> print headline.get_text()
Is this model too thin for Yves Saint Laurent? 
>>> print headline.get_text(' - ', strip=True)
Is this model - too thin - for Yves Saint Laurent?
>>> headline.contents
[u'Is this model ', <em>too thin</em>, u' for Yves Saint Laurent? ']
>>> from bs4 import NavigableString
>>> [el for el in headline.children if isinstance(el, NavigableString)]
[u'Is this model ', u' for Yves Saint Laurent? ']

这篇关于如何使用python的美汤获取标签和以HTML结尾的内容之间的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆