Beautifulsoup:当行不存在时,NaN else 值 [英] Beautifulsoup: When row not present, NaN else value

查看:21
本文介绍了Beautifulsoup:当行不存在时,NaN else 值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此代码从

当分数存在时,table-score 被填充

当分数不存在时,table-score 不存在

现在,home_oddsaway_oddsdraw_odds 的列值会在 table-score 不是存在并因此错误地提供数据.

我该如何改变

 game_data.score.append(row[2])game_data.home_odds.append(row[3])game_data.draw_odds.append(row[4])game_data.away_odds.append(row[5])

这样,如果 table-score 不存在,game_data.score.append(row[2]) 将返回 NaN

game_data.home_odds.append(row[2])game_data.draw_odds.append(row[3])game_data.away_odds.append(row[4])

否则当前的输出是什么?

解决方案

您需要先:

from numpy import nan

然后修改代码如下:

 ...# 分数存在吗?如果 ':' 不在行 [2] 中:# 不,右移几列:行[5], 行[4], 行[3], 行[2] = 行[4], 行[3], 行[2], nangame_data.score.append(row[2])game_data.home_odds.append(nan if row[3] == '-' else row[3])game_data.draw_odds.append(nan if row[4] == '-' else row[4])game_data.away_odds.append(nan if row[5] == '-' else row[5])...

请注意,必须修改 generate_matches 以返回 list 实例而不是 tuple 实例,因为上面的代码现在要求返回值,即row,可修改.

综合起来:

将pandas导入为pd从 numpy 进口南从 bs4 导入 BeautifulSoup 作为 bs从硒导入网络驱动程序进口螺纹从 multiprocessing.pool 导入线程池,池从 functools 导入部分导入操作系统进口重新类驱动程序:def __init__(self):选项 = webdriver.ChromeOptions()options.add_argument("--headless")# 取消注释下一行以禁止日志记录:options.add_experimental_option('excludeSwitches', ['enable-logging'])self.driver = webdriver.Chrome(options=options)def __del__(self):self.driver.quit() # 当我们清理的时候清理驱动# print('驱动程序已被退出".')threadLocal = threading.local()def create_driver():the_driver = getattr(threadLocal, 'the_driver', None)如果 the_driver 是 None:the_driver = 驱动程序()setattr(threadLocal, 'the_driver', the_driver)返回 the_driver.driver类游戏数据:def __init__(self):self.date = []self.time = []self.game = []self.score = []self.home_odds = []self.draw_odds = []self.away_odds = []self.country = []self.league = []def generate_matches(table):tr_tags = table.findAll('tr')对于 tr_tags 中的 tr_tag:如果 tr_tag.attrs 中的 'class' 和 tr_tag['class'] 中的 'dark':th_tag = tr_tag.find('th', {'class': 'first2 tl'})a_tags = th_tag.findAll('a')country = a_tags[0].text联赛 = a_tags[1].text别的:td_tags = tr_tag.findAll('td')产量 [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text, \td_tags[4].text、td_tags[5].text、国家、联赛]def parse_data(process_pool, url, return_urls=False):浏览器 = create_driver()browser.get(url)# 等待初始内容用分数动态更新:browser.implicitly_wait(5)table = browser.find_element_by_xpath('//*[@id="table-matches"]/table')# 如果你没有给这个函数传递一个Pool实例来使用# 多处理用于 CPU 密集型工作,# 然后将下一条语句替换为: return process_page(browser.page_source, return_urls)返回 process_pool.apply(process_page, args=(browser.page_source, return_urls))def process_page(page_source, return_urls):汤 = bs(page_source, lxml")div = 汤.find('div', {'id': 'table-matches'})table = div.find('table', {'class': 'table-main'})h1 = 汤.find('h1').text打印(h1)m = re.search(r'\d+ \w+ \d{4}$', h1)游戏日期 = m[0]游戏数据 = 游戏数据()对于 generate_matches(table) 中的行:game_data.date.append(game_date)game_data.time.append(row[0])game_data.game.append(row[1])# 分数存在吗?如果 ':' 不在行 [2] 中:# 不,右移几列:行[5], 行[4], 行[3], 行[2] = 行[4], 行[3], 行[2], nangame_data.score.append(row[2])game_data.home_odds.append(nan if row[3] == '-' else row[3])game_data.draw_odds.append(nan if row[4] == '-' else row[4])game_data.away_odds.append(nan if row[5] == '-' else row[5])game_data.country.append(row[6])game_data.league.append(row[7])如果 return_urls:span = soup.find('span', {'class': 'next-games-date'})a_tags = span.findAll('a')urls = ['https://www.oddsportal.com' + a_tag['href'] 用于 a_tags 中的 a_tag]返回游戏数据,网址返回游戏数据如果 __name__ == '__main__':结果 = 无pool = ThreadPool(3) # 这似乎是这个应用程序的最佳选择# 创建多处理池来进行 CPU 密集型处理:process_pool = Pool(min(5, os.cpu_count())) # 5 似乎是这个应用程序的最佳选择# 获取今天的数据和其他日子的网址:game_data_today, urls = pool.apply(parse_data, args=(process_pool, 'https://www.oddsportal.com/matches/soccer', True))urls.pop(1) # 删除今天的 url: 我们已经有了那个数据game_data_results = pool.imap(partial(parse_data, process_pool), urls)对于范围内的 i (8):game_data = game_data_today if i == 1 else next(game_data_results)结果 = pd.DataFrame(game_data.__dict__)如果结果为无:结果 = 结果别的:结果 = results.append(result, ignore_index=True)打印(结果)# 打印(结果.头())# 确保所有驱动程序都退出":删除线程本地

打印:

下一场足球比赛:今天,2021 年 9 月 10 日下一场足球比赛:2021 年 9 月 14 日,星期二下一场足球比赛:2021 年 9 月 15 日,星期三下一场足球比赛:2021 年 9 月 16 日,星期四下一场足球比赛:昨天,2021 年 9 月 9 日下一场足球比赛:2021 年 9 月 12 日,星期日下一场足球比赛:2021 年 9 月 13 日,星期一下一场足球比赛:明天,2021 年 9 月 11 日日期时间比赛得分 home_odds draw_odds away_odds country League2021 年 9 月 09 日 00:00 Cumbaya - Guayaquil SC 1:0 -169 +263 +462 厄瓜多尔乙级联赛2021 年 9 月 1 日 00:00 FC 塔尔萨 - 印地十一 2:1 -104 +265 +237 美国 USL 锦标赛2021 年 9 月 2 日 00:05 Pumas Tabasco - Atlante 0:2 +221 +186 +134 墨西哥 Liga de Expansion MX2021 年 9 月 3 日 00:05 巴拿马 - 墨西哥 1:1 +518 +250 -156 2022 年世界杯2021 年 9 月 4 日 00:10 Defensa y Justicia - Tigre 0:1 笔.+138 +199 +214 阿根廷杯阿根廷……………………………………1987 2021 年 9 月 16 日 19:00 奥林匹亚科斯比雷埃夫斯 - 安特卫普 NaN -137 +296 +371 欧洲欧洲联赛1988 2021 年 9 月 16 日 19:15 Academica - Estrela NaN -106 +231 +290 葡萄牙联赛葡萄牙 21989 2021 年 9 月 16 日 21:00 Barnechea - Rangers NaN +202 +202 +127 智利甲级联赛 B1990 2021 年 9 月 16 日 22:00 San Marcos de Arica - S. Morning NaN +212 +214 +122 智利甲级联赛 B1991 2021 年 9 月 16 日 23:30 U. De Concepcion - Coquimbo NaN +158 +198 +162 智利甲级联赛 B[1992 行 x 9 列]

This code gets data from www.oddsportal.com

How can I accomodate for when there is no score present for any event in this code?

Currently, the code scrapes all data from the pages:

import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import threading
from multiprocessing.pool import ThreadPool
import os
import re

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # Un-comment next line to supress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit()  # clean up driver when we are cleaned up
        # print('The driver has been "quitted".')


threadLocal = threading.local()


def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


class GameData:
    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []


def generate_matches(table):
    tr_tags = table.findAll('tr')
    for tr_tag in tr_tags:
        if 'class' in tr_tag.attrs and 'dark' in tr_tag['class']:
            th_tag = tr_tag.find('th', {'class': 'first2 tl'})
            a_tags = th_tag.findAll('a')
            country = a_tags[0].text
            league = a_tags[1].text
        else:
            td_tags = tr_tag.findAll('td')
            yield td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text, \
                  td_tags[4].text, td_tags[5].text, country, league


def parse_data(url, return_urls=False):
    browser = create_driver()
    browser.get(url)
    soup = bs(browser.page_source, "lxml")
    div = soup.find('div', {'id': 'col-content'})
    table = div.find('table', {'class': 'table-main'})
    h1 = soup.find('h1').text
    print(h1)
    m = re.search(r'\d+ \w+ \d{4}$', h1)
    game_date = m[0]
    game_data = GameData()
    for row in generate_matches(table):
        game_data.date.append(game_date)
        game_data.time.append(row[0])
        game_data.game.append(row[1])
        game_data.score.append(row[2])
        game_data.home_odds.append(row[3])
        game_data.draw_odds.append(row[4])
        game_data.away_odds.append(row[5])
        game_data.country.append(row[6])
        game_data.league.append(row[7])

    if return_urls:
        span = soup.find('span', {'class': 'next-games-date'})
        a_tags = span.findAll('a')
        urls = ['https://www.oddsportal.com' + a_tag['href'] for a_tag in a_tags]
        return game_data, urls
    return game_data


if __name__ == '__main__':
    results = None
    pool = ThreadPool(5)  # We will be getting, however, 7 URLs
    # Get today's data and the Urls for the other days:
    game_data_today, urls = pool.apply(parse_data, args=('https://www.oddsportal.com/matches/soccer', True))
    urls.pop(1)  # Remove url for today: We already have the data for that
    game_data_results = pool.imap(parse_data, urls)
    for i in range(8):
        game_data = game_data_today if i == 1 else next(game_data_results)
        result = pd.DataFrame(game_data.__dict__)
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

    print(results)
    # print(results.head())
    # ensure all the drivers are "quitted":
    del threadLocal
    import gc

    gc.collect()  # a little extra insurance

When scores are present, table-score is populated

When scores are not present, table-score is not present

Right now, the column values for home_odds, away_odds and draw_odds shift when table-score is not present and hence, providing data incorrectly.

How can I change

    game_data.score.append(row[2])
    game_data.home_odds.append(row[3])
    game_data.draw_odds.append(row[4])
    game_data.away_odds.append(row[5])

such that if table-score is not present, game_data.score.append(row[2]) would return NaN and

game_data.home_odds.append(row[2])
game_data.draw_odds.append(row[3])
game_data.away_odds.append(row[4])

else as the output is currently?

解决方案

You need to first:

from numpy import nan

And then modify code as follows:

        ...
        # Score present?
        if ':' not in row[2]:
            # No, shift a few columns right:
            row[5], row[4], row[3], row[2] = row[4], row[3], row[2], nan
        game_data.score.append(row[2])
        game_data.home_odds.append(nan if row[3] == '-' else row[3])
        game_data.draw_odds.append(nan if row[4] == '-' else row[4])
        game_data.away_odds.append(nan if row[5] == '-' else row[5])
        ...

Note that generate_matches has to be modified to return list instances rather than tuple instances since the above code now requires that the return values, i.e. row, be modifiable.

Putting it all together:

import pandas as pd
from numpy import nan
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import threading
from multiprocessing.pool import ThreadPool, Pool
from functools import partial
import os
import re

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # Un-comment next line to supress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit()  # clean up driver when we are cleaned up
        # print('The driver has been "quitted".')


threadLocal = threading.local()


def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


class GameData:
    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []


def generate_matches(table):
    tr_tags = table.findAll('tr')
    for tr_tag in tr_tags:
        if 'class' in tr_tag.attrs and 'dark' in tr_tag['class']:
            th_tag = tr_tag.find('th', {'class': 'first2 tl'})
            a_tags = th_tag.findAll('a')
            country = a_tags[0].text
            league = a_tags[1].text
        else:
            td_tags = tr_tag.findAll('td')
            yield [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text, \
                  td_tags[4].text, td_tags[5].text, country, league]


def parse_data(process_pool, url, return_urls=False):
    browser = create_driver()
    browser.get(url)
    # Wait for initial content to be dynamically updated with scores:
    browser.implicitly_wait(5)
    table = browser.find_element_by_xpath('//*[@id="table-matches"]/table')
    # If you do not pass a Pool instance to this function to use
    # multiprocessing for the more CPU-intensive work,
    # then just replace next statement with: return process_page(browser.page_source, return_urls)
    return process_pool.apply(process_page, args=(browser.page_source, return_urls))

def process_page(page_source, return_urls):
    soup = bs(page_source, "lxml")
    div = soup.find('div', {'id': 'table-matches'})
    table = div.find('table', {'class': 'table-main'})
    h1 = soup.find('h1').text
    print(h1)
    m = re.search(r'\d+ \w+ \d{4}$', h1)
    game_date = m[0]
    game_data = GameData()
    for row in generate_matches(table):
        game_data.date.append(game_date)
        game_data.time.append(row[0])
        game_data.game.append(row[1])
        # Score present?
        if ':' not in row[2]:
            # No, shift a few columns right:
            row[5], row[4], row[3], row[2] = row[4], row[3], row[2], nan
        game_data.score.append(row[2])
        game_data.home_odds.append(nan if row[3] == '-' else row[3])
        game_data.draw_odds.append(nan if row[4] == '-' else row[4])
        game_data.away_odds.append(nan if row[5] == '-' else row[5])
        game_data.country.append(row[6])
        game_data.league.append(row[7])

    if return_urls:
        span = soup.find('span', {'class': 'next-games-date'})
        a_tags = span.findAll('a')
        urls = ['https://www.oddsportal.com' + a_tag['href'] for a_tag in a_tags]
        return game_data, urls
    return game_data


if __name__ == '__main__':
    results = None

    pool = ThreadPool(3) # This seems to be optimal for this application
    # Create multiprocessing pool to do the CPU-intensive processing:
    process_pool = Pool(min(5, os.cpu_count())) # 5 seems to be optimal for this application
    # Get today's data and the Urls for the other days:
    game_data_today, urls = pool.apply(parse_data, args=(process_pool, 'https://www.oddsportal.com/matches/soccer', True))
    urls.pop(1)  # Remove url for today: We already have the data for that
    game_data_results = pool.imap(partial(parse_data, process_pool), urls)
    for i in range(8):
        game_data = game_data_today if i == 1 else next(game_data_results)
        result = pd.DataFrame(game_data.__dict__)
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

    print(results)
    # print(results.head())
    # ensure all the drivers are "quitted":
    del threadLocal

Prints:

Next Soccer Matches: Today, 10 Sep 2021
Next Soccer Matches: Tuesday, 14 Sep 2021
Next Soccer Matches: Wednesday, 15 Sep 2021
Next Soccer Matches: Thursday, 16 Sep 2021
Next Soccer Matches: Yesterday, 09 Sep 2021
Next Soccer Matches: Sunday, 12 Sep 2021
Next Soccer Matches: Monday, 13 Sep 2021
Next Soccer Matches: Tomorrow, 11 Sep 2021
             date   time                              game     score home_odds draw_odds away_odds     country                league
0     09 Sep 2021  00:00            Cumbaya - Guayaquil SC       1:0      -169      +263      +462     Ecuador               Serie B
1     09 Sep 2021  00:00            FC Tulsa - Indy Eleven       2:1      -104      +265      +237         USA      USL Championship
2     09 Sep 2021  00:05           Pumas Tabasco - Atlante       0:2      +221      +186      +134      Mexico  Liga de Expansion MX
3     09 Sep 2021  00:05                   Panama - Mexico       1:1      +518      +250      -156       World        World Cup 2022
4     09 Sep 2021  00:10        Defensa y Justicia - Tigre  0:1 pen.      +138      +199      +214   Argentina        Copa Argentina
...           ...    ...                               ...       ...       ...       ...       ...         ...                   ...
1987  16 Sep 2021  19:00      Olympiacos Piraeus - Antwerp       NaN      -137      +296      +371      Europe         Europa League
1988  16 Sep 2021  19:15               Academica - Estrela       NaN      -106      +231      +290    Portugal       Liga Portugal 2
1989  16 Sep 2021  21:00               Barnechea - Rangers       NaN      +202      +202      +127       Chile             Primera B
1990  16 Sep 2021  22:00  San Marcos de Arica - S. Morning       NaN      +212      +214      +122       Chile             Primera B
1991  16 Sep 2021  23:30       U. De Concepcion - Coquimbo       NaN      +158      +198      +162       Chile             Primera B

[1992 rows x 9 columns]

这篇关于Beautifulsoup:当行不存在时,NaN else 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆