beautifulsoup .get_text()不是对我的HTML解析不够具体 [英] beautifulsoup .get_text() is not specific enough for my HTML parsing

查看:894
本文介绍了beautifulsoup .get_text()不是对我的HTML解析不够具体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于HTML code下面我想输出H1的只是文本,但不是关于详情及NBSP;,这是跨度(这是由H1封装)的文本。

Given the HTML code below I want output just the text of the h1 but not the "Details about  ", which is the text of the span (which is encapsulated by the h1).

我的电流输出给:

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我想:

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

这里是我一起工作的HTML

Here is the HTML I am working with

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

下面是我目前的code:

Here is my current code:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

请注意:我不希望只是截断字符串,因为我想这code有一定的重复使用性。
什么是最好的一些code的出露由跨度为界的任何文本。

Note: I do not want to just truncate the string because I would like this code to have some re-usability. What would be best is some code that crops out any text that is bounded by the span.

推荐答案

您可以使用<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Removing%20elements\"><$c$c>extract()删除所有跨度标签:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

这篇关于beautifulsoup .get_text()不是对我的HTML解析不够具体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆