从此网站上刮取PDF [英] Scraping pdfs from this web
问题描述
我正在尝试从此网站上删除python 2.7:
I am trying to scrap with python 2.7 from this website:
http://www.motogp.com/en/Results+Statistics/
我要删除主要的类别,它具有很多类别(事件),该类别显示在MotoGP Race分类2017蓝色字母旁边
I want to scrap the main one, that has many categories (Event), the one that appears next to the MotoGP Race Classification 2017 blue letters
在那次报废之后也是如此.到目前为止,我有:
And after that scrap for years as well. So far I have:
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://www.motogp.com/en/Results+Statistics/"
r = urlopen(url).read()
soup = BeautifulSoup(r)
type(soup)
match = re.search(b'\"(.*?\.pdf)\"', r)
pdf_url="http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification" + match.group(1).decode('utf8')
链接是这种类型的:
http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c
所以我应该添加东西?"在角色之后.主要问题是如何从事件切换到事件以获取所有这种格式的链接.
So I should add the thing "?" after the character. The main problem is how to switch from event to event to get all the links in this type of format.
推荐答案
根据上面提供的说明,这是如何获取那些pdf
链接的方法:
According to the description you have provided above, this is how can get those pdf
links:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("http://www.motogp.com/en/Results+Statistics/")
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):
item.click()
elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))
print(elem.get_attribute("href"))
wait.until(EC.staleness_of(elem))
driver.quit()
部分输出:
http://resources.motogp.com/files/results/2017/VAL/MotoGP/RAC/worldstanding.pdf?v1_8dbea75c
http://resources.motogp.com/files/results/2017/QAT/MotoGP/RAC/Classification.pdf?v1_f6564614
http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification.pdf?v1_9107e18d
http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c
http://resources.motogp.com/files/results/2017/SPA/MotoGP/RAC/Classification.pdf?v1_ba33b120
这篇关于从此网站上刮取PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!