使用BeautifulSoup在结果集中获取td标签的_text [英] Using BeautifulSoup to get_text of td tags within a resultset

查看:24
本文介绍了使用BeautifulSoup在结果集中获取td标签的_text的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从以下网站使用BeautifulSoup提取表数据: https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html

I am extracting table data using BeautifulSoup from this website:https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html

有许多具有唯一表ID的表,我已经可以使用以下方法提取它们:

There are many tables with a unique table id, that I have been able to extract using the following:

from bs4 import BeautifulSoup
from selenium import webdriver

stat_dict={'Disposals' : 'sortableTable0',
           'Kicks' : 'sortableTable1',
           'Marks' : 'sortableTable2',
           'Handballs' : 'sortableTable3',
           'Goals' : 'sortableTable4',
           'Behinds' : 'sortableTable5',
           'Hitouts' : 'sortableTable6',
           'Tackles' : 'sortableTable7',
           'Rebounds' : 'sortableTable8',
           'Inside50s' : 'sortableTable9',
           'Clearances': 'sortableTable10',
           'Clangers' : 'sortableTable11',
           'FreesFor' : 'sortableTable12',
           'FreesAgainst' : 'sortableTable13',
           'ContestedPosessions' : 'sortableTable14',
           'UncontestedPosesseions' : 'sortableTable15',
           'ContestedMarks' : 'sortableTable16',
           'MarksInside50' : 'sortableTable17',
           'OnePercenters' : 'sortableTable18',
           'Bounces' : 'sortableTable19',
           'GoalAssists' : 'sortableTable20',
           'Timeplayed' : 'sortableTable21'}

driver = webdriver.Firefox(executable_path='...')
url="https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, "lxml")

stat_wanted='Disposals'
table = soup.find_all('table', {'id':stat_dict[stat_wanted]})

从我提取的表中,我想做下面等效的代码,如果我使用soup.find('tbody'),它可以工作.我知道这可能不是获得结果的最好或最漂亮的方法,但是我只是在玩弄代码以了解其全部工作原理.

From the table I have extracted, I'd like to do the equivalent of the code below which works if I use soup.find('tbody'). I know that this probably isn't the best or prettiest way of achieve the result, but I'm just playing around with the code to learn how it all works.

def get_disposals(team_lower_case, nplayers, nrounds):
    list=[]
    page=requests.get("https://afltables.com/afl/stats/teams/" +str(team_lower_case) +"/2018_gbg.html")
    soup=BeautifulSoup(page.content, 'html.parser')
    filter=soup.find('tbody')

    for var in filter.find_all('tr'):
        columns=var.find_all('td')
        for val in columns:
            list.append(val.get_text())
    columns=['PlayerName']

    for n in range(1,nrounds+1):
        columns.append('R'+str(n))

    df=pd.DataFrame(np.array(list).reshape(nplayers,nrounds+1), columns=columns)
    return df

get_disposals("fremantle",30,8)

我尝试使用下面的代码从所有标记中获取文本,但是结果并没有复制我在第一段代码中提取特定表时所能实现的功能.

I've tried the code below to get the text from all tags but the result isn't replicating what I have been able to achieve when extracting the specific table in the first snippet of code.

for tr in table:
    zxc=tr.find_all('td')
print(zxc)
for var in zxc:
    list=[]
    list.append(var.get_text())
print(list)

但是这只会导致标记及其内容的列表,而不是如果get_text能够按我希望的那样工作时所期望的内容.

But this results in just a list of the tags and their contents, not the contents you'd expect if get_text was working as I would like it to.

推荐答案

您可能会发现以下方法更简单:

You might find the following approach a bit easier:

import pandas as pd    

tables = pd.read_html("https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html")

for df in tables:
    df.drop(df.columns[9:], axis=1, inplace=True)   # remove unwanted columns
    df.columns = df.columns.droplevel(0)    # remove extra index level

for table in tables:
    print(table[:3:], '\n')  # show first 3 rows

这将为您提供熊猫数据框列表.每个表都包含每个表的所有信息.因此,例如,第一个包含 Disposals :

This will give you a list of pandas dataframes. Each one contains all the information for each table. So for example, the first one contains Disposals:

         Player    R1    R2    R3    R4    R5    R6    R7  Tot
0  Atkins, Rory  14.0  17.0  22.0  28.0  24.0  28.0  16.0  149
1  Betts, Eddie  14.0  20.0  16.0   6.0   NaN   NaN  10.0   66
2   Brown, Luke  15.0  23.0  23.0  16.0  16.0  24.0  11.0  128 

         Player    R1    R2    R3    R4    R5    R6    R7  Tot
0  Atkins, Rory   8.0  13.0  12.0  16.0  17.0  18.0  10.0   94
1  Betts, Eddie   7.0   6.0  10.0   2.0   NaN   NaN   7.0   32
2   Brown, Luke  10.0  17.0  17.0  10.0  11.0  16.0   9.0   90

然后您可以使用熊猫来处理数据.

You could then use pandas to work with the data.

这篇关于使用BeautifulSoup在结果集中获取td标签的_text的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆