python - 通过在下拉列表中导航不同的选项来抓取表格 [英] python - scraping tables by navigating different options in drop down list

查看:26
本文介绍了python - 通过在下拉列表中导航不同的选项来抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从该站点抓取数据:https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx

I'm trying to scrape data from this site: https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx

网站已将默认年份设置为 2018 年(最近的一年),我想抓取所有可用年份.

The default year has been set as 2018 (the most recent year) by the website and I want to scrape all available years.

4 年前有人问过一个非常相似的问题,但似乎没有用.

A very similar question has been asked 4 years ago but it doesn't seem to work.

从下拉列表中的选定选项中抓取响应列表

当我运行它时,它对我所做的就是从默认年份打印出表格,而不管我分配的参数如何.

All it does for me when I run it is print out the table from the default year regardless of parameter I assign.

我无法通过 url 访问不同的年份,因为当我在下拉框中选择选项时 url 不会改变.所以我尝试使用 webdriver 和 xpath.

I can't access different years via url since url doesn't change when I select options in the drop down box. So I tried using webdriver and xpath.

这是我尝试的代码:

url = "https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx"

driver = webdriver.Chrome("/Applications/chromedriver")
driver.get(url)

year = 2017
driver.find_element_by_xpath("//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@value='"+str(year)+"']").click()
page = driver.page_source
bs_obj = BSoup(page, 'html.parser')

header_row = bs_obj.find_all('table')[0].find('thead').find('tr').find_all('th')
body_rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')
footer_row = bs_obj.find_all('table')[0].find('tfoot').find('tr').find_all('td')

headings = []
footings = []

for heading in header_row:
    headings.append(heading.get_text())

for footing in footer_row:
    footings.append(footing.get_text())

body = []

for row in body_rows:
    cells = row.find_all('td')
    row_temp = []
    for i in range(len(cells)):
        row_temp.append(cells[i].get_text())
    body.append(row_temp)

driver.quit()
print(headings)
print(body)
print(footings)

我预计输出会按照我的指定打印出 2017 年的表格,但实际输出会打印出 2018 年(默认年份)的表格.谁能给我解决这个问题的想法?

I expected the output to print out the table from the year 2017 as I specified but the actual output prints out the table from the year 2018 (the default year). Can anyone give me ideas to solve this problem?

我刚刚发现我通过检查"看到的与我从页面源"获得的不同.具体来说,页面源仍然有2018"作为选择选项(这不是我想要的),而检查显示我选择了2017".但仍然坚持如何使用检查"而不是页面源.

I just found out that what I see by doing "Inspect" is different from what I get from "Page Source". Specifically, page source still has "2018" as the Select option (which is not what I want), whereas Inspect shows me "2017" is selected. But still stuck on how to use "Inspect" rather than page source.

推荐答案

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup as BSoup
url = "https://www.koreabaseball.com/Record/Team/Hitter/Basic1.aspx"
driver = webdriver.Chrome("/Applications/chromedriver")
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.XPATH, "//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@value='"+str(year)+"']"))
)
element.click()
#its better to wait till some text has changed
#but this will do for now


WebDriverWait(driver, 3).until(
    EC.text_to_be_present_in_element(
        (By.XPATH, "//select[@name='ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$ddlSeason$ddlSeason']/option[@selected='selected']"),
        str(year)
    )
)
#sleep for some time to complete ajax load of the table
#sleep(10)
page = driver.page_source
bs_obj = BSoup(page, 'html.parser')

header_row = bs_obj.find_all('table')[0].find('thead').find('tr').find_all('th')
body_rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')
footer_row = bs_obj.find_all('table')[0].find('tfoot').find('tr').find_all('td')

headings = []
footings = []

for heading in header_row:
    headings.append(heading.get_text())

for footing in footer_row:
    footings.append(footing.get_text())

body = []

for row in body_rows:
    cells = row.find_all('td')
    row_temp = []
    for i in range(len(cells)):
        row_temp.append(cells[i].get_text())
    body.append(row_temp)

driver.quit()
print(headings)
print(body)

输出

['순위', '팀명', 'AVG', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'SAC', 'SF']
[['1', 'KIA', '0.302', '144', '5841', '5142', '906', '1554', '292', '29', '170', '2414', '868', '55', '56'], ['2', '두산', '0.294', '144', '5833', '5102', '849', '1499', '270', '20', '178', '2343', '812', '48', '47'], ['3', 'NC', '0.293', '144', '5790', '5079', '786', '1489', '277', '19', '149', '2251', '739', '62', '48'], ['4', '넥센', '0.290', '144', '5712', '5098', '789', '1479', '267', '30', '141', '2229', '748', '21', '42'], ['5', '한화', '0.287', '144', '5665', '5030', '737', '1445', '261', '16', '150', '2188', '684', '85', '38'], ['6', '롯데', '0.285', '144', '5671', '4994', '743', '1425', '250', '17', '151', '2162', '697', '76', '32'], ['7', 'LG', '0.281', '144', '5614', '4944', '699', '1390', '216', '20', '110', '1976', '663', '76', '55'], ['8', '삼성', '0.279', '144', '5707', '5095', '757', '1419', '255', '36', '145', '2181', '703', '58', '55'], ['9', 'KT', '0.275', '144', '5485', '4937', '655', '1360', '274', '17', '119', '2025', '625', '62', '45'], ['10', 'SK', '0.271', '144', '5564', '4925', '761', '1337', '222', '15', '234', '2291', '733', '57', '41']]

您必须等待一段时间才能点击后表格刷新.也请阅读我的评论.睡觉不是最好的选择.

You have to wait for some time for the table to refresh after you click. Also read my comments. Sleep is not the best option.

我已编辑代码以等到所选文本为年份.代码不再使用睡眠.

I have edited the code to wait till the selected text is the year. The code no longer uses sleep.

这篇关于python - 通过在下拉列表中导航不同的选项来抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆