使用beautifulsoup有效地解析字符串 [英] Using beautifulsoup to parse string efficiently

查看:149
本文介绍了使用beautifulsoup有效地解析字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析此html以获得商品标题(例如,Big Boss空气炸锅-健康的1300瓦超大型16夸脱,油炸锅5色-新)

I am trying to parse this html to get the item title (e.g. Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW)

<div style="" class="">
    <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW</h1>
            <h2 id="subTitle" class="it-sttl">
            Brand New + Free Shipping, Satisfaction Guaranteed! </h2>
    <!-- DO NOT change linkToTagId="rwid" as the catalog response has this ID set  -->
    <div class="vi-hdops-three-clmn-fix">           
        <div style="" class="vi-notify-new-bg-wrapper">
                <div class="vi-notify-new-bg-dTop" style=""> </div>
                <div id="vi_notification_new" class="vi-notify-new-bg-dBtm" style="top: -28px;"> 
                    <img src="https://ir.ebaystatic.com/rs/v/tnj4p1myre1mpff12w4j1llndmc.png" width="11" height="12" class="vi-notify-new-img" alt="Popular">
                    <span style="font-weight:bold;">5 sold in last 24 hours</span>
                </div>
            </div>
        </div>      
    </div>

我正在使用以下代码来解析页面

I am using the following code to parse the page

url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244?    epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)

    for item in soup.findAll('h1', {'class':'it-ttl'}):
        print(item.string) # Use item.text

get_single_item_data(url1)

当我这样做时,beautifulsoup返回"None".

When I do this, beautifulsoup return 'None'.

我发现的一个解决方案是改用print(item.text),但是现在我得到了这个``关于大老板空气炸锅的详细信息-健康的1300瓦超大型16夸脱,油炸锅5色-新''(我愿意不需要有关"的详细信息.)

One solution I found is to use print(item.text) instead, but now I get this 'Details about  Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW'(I do not want 'Details about ').

是否有一种有效的方法来获取项目标题,而不必获取文本,然后取消关于"的详细信息?

Is there an efficient way to get the item title without having to get the text and then taking off the 'Details about '?

推荐答案

这是由于.string属性的这一警告:

This is because of this caveat of the .string attribute:

如果标签包含多个内容,则不清楚.string应该指的是什么,因此.string被定义为None

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

由于header元素包含多个子元素-无法定义,并且默认为None.

Since the header element contains multiple children - it cannot be defined and defaults to None.

为避免削减详细信息"部分,您可以采用非递归模式获得第一个文本节点:

To avoid cutting of "Details about" part, you can get the first text node in a non-recursive mode:

soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False)

演示:

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: print(soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False))
Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW

这篇关于使用beautifulsoup有效地解析字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆