使用BeautifulSoup在结果集中获取td标签的_text [英] Using BeautifulSoup to get_text of td tags within a resultset
问题描述
我正在从以下网站使用BeautifulSoup提取表数据: https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html
I am extracting table data using BeautifulSoup from this website:https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html
有许多具有唯一表ID的表,我已经可以使用以下方法提取它们:
There are many tables with a unique table id, that I have been able to extract using the following:
from bs4 import BeautifulSoup
from selenium import webdriver
stat_dict={'Disposals' : 'sortableTable0',
'Kicks' : 'sortableTable1',
'Marks' : 'sortableTable2',
'Handballs' : 'sortableTable3',
'Goals' : 'sortableTable4',
'Behinds' : 'sortableTable5',
'Hitouts' : 'sortableTable6',
'Tackles' : 'sortableTable7',
'Rebounds' : 'sortableTable8',
'Inside50s' : 'sortableTable9',
'Clearances': 'sortableTable10',
'Clangers' : 'sortableTable11',
'FreesFor' : 'sortableTable12',
'FreesAgainst' : 'sortableTable13',
'ContestedPosessions' : 'sortableTable14',
'UncontestedPosesseions' : 'sortableTable15',
'ContestedMarks' : 'sortableTable16',
'MarksInside50' : 'sortableTable17',
'OnePercenters' : 'sortableTable18',
'Bounces' : 'sortableTable19',
'GoalAssists' : 'sortableTable20',
'Timeplayed' : 'sortableTable21'}
driver = webdriver.Firefox(executable_path='...')
url="https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
stat_wanted='Disposals'
table = soup.find_all('table', {'id':stat_dict[stat_wanted]})
从我提取的表中,我想做下面等效的代码,如果我使用soup.find('tbody'),它可以工作.我知道这可能不是获得结果的最好或最漂亮的方法,但是我只是在玩弄代码以了解其全部工作原理.
From the table I have extracted, I'd like to do the equivalent of the code below which works if I use soup.find('tbody'). I know that this probably isn't the best or prettiest way of achieve the result, but I'm just playing around with the code to learn how it all works.
def get_disposals(team_lower_case, nplayers, nrounds):
list=[]
page=requests.get("https://afltables.com/afl/stats/teams/" +str(team_lower_case) +"/2018_gbg.html")
soup=BeautifulSoup(page.content, 'html.parser')
filter=soup.find('tbody')
for var in filter.find_all('tr'):
columns=var.find_all('td')
for val in columns:
list.append(val.get_text())
columns=['PlayerName']
for n in range(1,nrounds+1):
columns.append('R'+str(n))
df=pd.DataFrame(np.array(list).reshape(nplayers,nrounds+1), columns=columns)
return df
get_disposals("fremantle",30,8)
我尝试使用下面的代码从所有标记中获取文本,但是结果并没有复制我在第一段代码中提取特定表时所能实现的功能.
I've tried the code below to get the text from all tags but the result isn't replicating what I have been able to achieve when extracting the specific table in the first snippet of code.
for tr in table:
zxc=tr.find_all('td')
print(zxc)
for var in zxc:
list=[]
list.append(var.get_text())
print(list)
但是这只会导致标记及其内容的列表,而不是如果get_text能够按我希望的那样工作时所期望的内容.
But this results in just a list of the tags and their contents, not the contents you'd expect if get_text was working as I would like it to.
推荐答案
您可能会发现以下方法更简单:
You might find the following approach a bit easier:
import pandas as pd
tables = pd.read_html("https://afltables.com/afl/stats/teams/adelaide/2018_gbg.html")
for df in tables:
df.drop(df.columns[9:], axis=1, inplace=True) # remove unwanted columns
df.columns = df.columns.droplevel(0) # remove extra index level
for table in tables:
print(table[:3:], '\n') # show first 3 rows
这将为您提供熊猫数据框列表.每个表都包含每个表的所有信息.因此,例如,第一个包含 Disposals
:
This will give you a list of pandas dataframes. Each one contains all the information for each table. So for example, the first one contains Disposals
:
Player R1 R2 R3 R4 R5 R6 R7 Tot
0 Atkins, Rory 14.0 17.0 22.0 28.0 24.0 28.0 16.0 149
1 Betts, Eddie 14.0 20.0 16.0 6.0 NaN NaN 10.0 66
2 Brown, Luke 15.0 23.0 23.0 16.0 16.0 24.0 11.0 128
Player R1 R2 R3 R4 R5 R6 R7 Tot
0 Atkins, Rory 8.0 13.0 12.0 16.0 17.0 18.0 10.0 94
1 Betts, Eddie 7.0 6.0 10.0 2.0 NaN NaN 7.0 32
2 Brown, Luke 10.0 17.0 17.0 10.0 11.0 16.0 9.0 90
然后您可以使用熊猫来处理数据.
You could then use pandas to work with the data.
这篇关于使用BeautifulSoup在结果集中获取td标签的_text的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!