Python BeautifulSoup-刮除Div Spans和p标签-以及如何在div名称上获得完全匹配 [英] Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name

查看:83
本文介绍了Python BeautifulSoup-刮除Div Spans和p标签-以及如何在div名称上获得完全匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要抓取两个具有相同名称的div(但我不希望页面上还有其他具有部分名称匹配的div).首先,我只需要每个span元素内的文本.在第二个中,我需要span元素内的文本,对于第一个行,那么我需要在第2行和第3行的

标记内的文本.

I have two divs I am trying to scrape, with the same name (but there are other divs on the page also with a partial name match, that I dont want). The first I just need the text inside each span element. In the second I need the text inside the span element, for the first row then I need the text inside the

tags for row 2 and 3.

我什至不太确定为什么需要在div的末尾切片(我认为是因为div类col返回的值大于2个相关的div,但是在div的末尾添加:1似乎有帮助)

I'm not even too sure why I need to slice at the end of the divs (I think because the div class col returns more than the 2 relevant divs, but adding :1 at the end of divs seems to help)

我的问题是-如何在div名称上获得完全匹配如何在p标签内抓取如何合并以上结果.我可以在span标签内获取文本,如下所示,但正如我在上面所说,我还需要在p标签内添加文本并组合结果.

My questions are - how to get an exact match on the div name How to scrape inside the p tags How to combine the results from the above. I can get the text inside the span tags, as shown below but as I say above I need the text inside the p tags also and combine the results.

数据来自此URL中的玩家详细信息部分-https://www.skysports.com/football/player/141016/alisson-ramses-becker

The data is coming from the player details section in this URL - https://www.skysports.com/football/player/141016/alisson-ramses-becker

html看起来像这样

The html looks like this

    <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>

我程序的相关部分

        premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
        premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})

        divs = player_soup.find_all( 'div', {'class': 'col'})
        for div in divs[:1]:
            para = div.find_all('p')
            print(para)

输出-

    [<p class="text-h4 title">Player Details</p>, <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>, <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>, <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>, <p>Club: <span itemprop="affiliation">Liverpool</span></p>, <p>Squad: 13</p>, <p>Position: Goal Keeper</p>]                               

也-我知道我可以用这个获取跨度文本

Also - I know I can get the span text with this

divs = player_soup.find_all( 'div', {'class': 'col'})
for div in divs[:1]:
    spans = div.find_all('span')
    for span in spans:       
        print(span.text, ",", end=' ')

输出-

Alisson Ramses Becker , 02/10/1992 ,  Brazil , Liverpool ,              

推荐答案

假设您有权删除此站点,并且没有API或json返回,那么一种较慢的方法是:

Assuming you have rights to scrap this site and there are no APIs or json returns, one slow way to do it is:

from bs4 import BeautifulSoup as bs

html = '''
 <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
'''

soup = bs(html,'html5lib')

data = [d.find_all('p') for d in soup.find_all('div',{'class':'col'})]

value = []
for i in data:
    for j in i:
        value.append(j.text)

print(value)

这篇关于Python BeautifulSoup-刮除Div Spans和p标签-以及如何在div名称上获得完全匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆