从此网站上刮取PDF [英] Scraping pdfs from this web

查看:82
本文介绍了从此网站上刮取PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此网站上删除python 2.7:

I am trying to scrap with python 2.7 from this website:

http://www.motogp.com/en/Results+Statistics/

我要删除主要的类别,它具有很多类别(事件),该类别显示在MotoGP Race分类2017蓝色字母旁边

I want to scrap the main one, that has many categories (Event), the one that appears next to the MotoGP Race Classification 2017 blue letters

在那次报废之后也是如此.到目前为止,我有:

And after that scrap for years as well. So far I have:

import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://www.motogp.com/en/Results+Statistics/"
r  = urlopen(url).read()
soup = BeautifulSoup(r)
type(soup)

match = re.search(b'\"(.*?\.pdf)\"', r)
pdf_url="http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification" + match.group(1).decode('utf8')

链接是这种类型的:

http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c

所以我应该添加东西?"在角色之后.主要问题是如何从事件切换到事件以获取所有这种格式的链接.

So I should add the thing "?" after the character. The main problem is how to switch from event to event to get all the links in this type of format.

推荐答案

根据上面提供的说明,这是如何获取那些pdf链接的方法:

According to the description you have provided above, this is how can get those pdf links:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("http://www.motogp.com/en/Results+Statistics/")

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):
    item.click()
    elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))
    print(elem.get_attribute("href"))
    wait.until(EC.staleness_of(elem))

driver.quit()

部分输出:

http://resources.motogp.com/files/results/2017/VAL/MotoGP/RAC/worldstanding.pdf?v1_8dbea75c
http://resources.motogp.com/files/results/2017/QAT/MotoGP/RAC/Classification.pdf?v1_f6564614
http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification.pdf?v1_9107e18d
http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c
http://resources.motogp.com/files/results/2017/SPA/MotoGP/RAC/Classification.pdf?v1_ba33b120

这篇关于从此网站上刮取PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆