当表格无法返回值时,如何抓取表格?(美丽汤) [英] How do you scrape a table when the table is unable to return values? (BeautifulSoup)

查看:65
本文介绍了当表格无法返回值时,如何抓取表格?(美丽汤)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的代码:

 将numpy导入为np将熊猫作为pd导入汇入要求从bs4导入BeautifulSoupstats_page = requests.get('https://www.sports-reference.com/cbb/schools/loyola-il/2020.html')内容= stats_page.content汤= BeautifulSoup(content,'html.parser')table = soup.find(name ='table',attrs = {'id':'per_poss'})html_str = str(表格)df = pd.read_html(html_str)[0]df.head() 

我得到错误: ValueError:找不到表.

但是,当我将 attrs = {'id':'per_poss'} 交换为另一个表ID时,例如 attrs = {'id':'per_game'} 我得到一个输出.

我不熟悉html和scraping,但是我注意到在工作表中,这是html:< table class ="sortable stats_table now_sortable is_sorted";id ="per_game"data-cols-to-freeze ="2">

在不起作用的表中,这是html:< table class ="sortable stats_table now_sortable sticky_table re2 le1"id =总计"data-cols-to-freeze ="2">

表类似乎不同,我不确定这是否导致此问题,以及如何解决.

谢谢!

解决方案

之所以会发生这种情况,是因为该表位于HTML注释内<!-....-> .

您可以提取表以检查标签是否为 评论 :

 将pandas导入为pd汇入要求从bs4导入BeautifulSoup,评论URL ="https://www.sports-reference.com/cbb/schools/loyola-il/2020.html"汤= BeautifulSoup(requests.get(URL).content,"html.parser")评论= soup.find_all(text = lambda t:isinstance(t,Comment))comment_soup = BeautifulSoup(str(comments),"html.parser")表格= comment_soup.select(#div_per_poss")[0]df = pd.read_html(str(comment_soup))打印(df) 

输出:

  [Rk Player G GS MP FG ... AST STL BLK TOV PF PTS0 1.0卡梅隆·克鲁特维格32 32.0 1001 201 ... 133 39 20 81 454821 2.0泰特音乐厅32 32.0 1052 141 ... 70 47 3 57 56 4062 3.0侯爵夫人肯尼迪32 6.0 671110 ... 43 38 9 37 72 2943 4.0卢卡斯·威廉姆森32 32.0 967 99 ... 53 49 9 57 64 2874 5.0基思·克莱蒙斯24 24.0 758 78 ... 47 29 1 32 50 2495 6.0阿赫·乌瓜克32 31.0 768 62 ... 61 15 3 59 561816 7.0贾隆小鸽30 1.0 392 34 ... 12 10 1 17 15 1017 8.0帕克森·沃西克(Paxson Wojcik)30 1.0 327 25 ... 18 14 0 14 23 61...... 

The following is my code:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

stats_page = requests.get('https://www.sports-reference.com/cbb/schools/loyola-il/2020.html')
content = stats_page.content
soup = BeautifulSoup(content, 'html.parser')
table = soup.find(name='table', attrs={'id':'per_poss'})

html_str = str(table)
df = pd.read_html(html_str)[0]
df.head()

And I get the error: ValueError: No tables found.

However, when I swap attrs={'id':'per_poss'} with a different table id like attrs={'id':'per_game'} I get an output.

I am not familiar with html and scraping but I noticed in the tables that work, this is the html: <table class="sortable stats_table now_sortable is_sorted" id="per_game" data-cols-to-freeze="2">

And in the tables that don't work, this is the html: <table class="sortable stats_table now_sortable sticky_table re2 le1" id="totals" data-cols-to-freeze="2">

It seems the table classes are different and I am not sure if that is causing this problem and how to fix it if so.

Thank you!

解决方案

This is happening because the table is within HTML comments <!-- .... -->.

You can extract the table checking if the tags are of the type Comment:

import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment

URL = "https://www.sports-reference.com/cbb/schools/loyola-il/2020.html"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

comments = soup.find_all(text=lambda t: isinstance(t, Comment))
comment_soup = BeautifulSoup(str(comments), "html.parser")

table = comment_soup.select("#div_per_poss")[0]
df = pd.read_html(str(comment_soup))
print(df)

Output:

[      Rk             Player   G    GS    MP   FG  ...  AST  STL  BLK  TOV   PF   PTS
0    1.0    Cameron Krutwig  32  32.0  1001  201  ...  133   39   20   81   45   482
1    2.0          Tate Hall  32  32.0  1052  141  ...   70   47    3   57   56   406
2    3.0   Marquise Kennedy  32   6.0   671  110  ...   43   38    9   37   72   294
3    4.0   Lucas Williamson  32  32.0   967   99  ...   53   49    9   57   64   287
4    5.0      Keith Clemons  24  24.0   758   78  ...   47   29    1   32   50   249
5    6.0         Aher Uguak  32  31.0   768   62  ...   61   15    3   59   56   181
6    7.0      Jalon Pipkins  30   1.0   392   34  ...   12   10    1   17   15   101
7    8.0      Paxson Wojcik  30   1.0   327   25  ...   18   14    0   14   23    61
...
...

这篇关于当表格无法返回值时,如何抓取表格?(美丽汤)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆