如何获得一个JS重定向的PDF从网页链接 [英] How to get a JS redirected pdf linked from a web page

查看:219
本文介绍了如何获得一个JS重定向的PDF从网页链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用要求来获取网页,举例如下。

 进口要求
从BS4进口BeautifulSoup
URL =htt​​p://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/CARE/EY298883
R = requests.get(URL)
汤= BeautifulSoup(r.text)

有关这些网页我想获得第一的PDF被点到标题为最新报道一节中的每一个。你怎么能与美丽的汤做到这一点?

在HTML中的相关部分。

 <&TBODY GT;
 &所述; TR>
          百分位范围=山口>最新报道< /第i
          百分位范围=山口级=日期>检验LT; BR />日期和LT; /第i
          百分位范围=山口级=日期>首先< BR />公布< BR />日期和LT; /第i
          < / TR>
          &所述; TR>
            < TD>< A HREF =/供应商/文件/ 1266031 /金塔/ 106428.pdf><跨度类=图标PDF> PDF< / SPAN>早年检验报告< / A>< / TD>
            < TD类=日期> 2009年&LT 3月12日; / TD>
            < TD类=日期> 2009年&LT 4月4日; / TD>
            < / TR> < / TBODY>


以下code看起来像它应该工作,但没有。


  ofstedbase =htt​​p://www.ofsted.gov.uk
在soup.findAll('日')col_header:
    如果不是col_header.contents [0] ==最新报道:继续
    在col_header.parent.parent.findAll('A')链接:
        如果link.attrs和链路['的href']'href属性的endsWith(PDF):突破
    其他:
        打印'最新报道PDF没有找到'
        打破
    打印'最新报道在PDF点',链接['href属性]
    P = requests.get(ofstedbase +链接['的href'])
    打印p.content
    打破


问题是, P 包含了另一个网页,而不是PDF它应该。是否有某种方式来获得实际的PDF?


更新:

明白了与BeautifulSoup的又一个迭代的工作。

  souppage = BeautifulSoup(p.text)
 行= souppage.findAll('A',文本= re.compile(要求))[0]
 PDF = requests.get(ofstedbase +线['HREF'])

任何更好的/更好的解决方案,感激地接受。


解决方案

得到它与BeautifulSoup的又一个迭代工作

  souppage = BeautifulSoup(p.text)
 行= souppage.findAll('A',文本= re.compile(要求))[0]
 PDF = requests.get(ofstedbase +线['HREF'])

I am using requests to get web pages, for example as follows.

import requests
from bs4 import BeautifulSoup
url = "http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/CARE/EY298883"
r = requests.get(url)
soup = BeautifulSoup(r.text)

For each one of these pages I would like to get the first pdf that is point to in the section titled "Latest reports". How can you do this with beautiful soup?

The relevant part of the HTML is

 <tbody>
 <tr>
          <th scope="col">Latest reports</th>
          <th scope="col" class="date">Inspection <br/>date</th>
          <th scope="col" class="date">First<br/>publication<br/>date</th>
          </tr>
          <tr>
            <td><a href="/provider/files/1266031/urn/106428.pdf"><span class="icon pdf">pdf</span> Early years inspection report </a></td>
            <td class="date">12 Mar 2009</td>
            <td class="date">4 Apr 2009</td>
            </tr>        </tbody>


The following code looks like it should work but does not.

ofstedbase = "http://www.ofsted.gov.uk"
for col_header in soup.findAll('th'):
    if not col_header.contents[0] == "Latest reports": continue
    for link in col_header.parent.parent.findAll('a'):
        if 'href' in link.attrs and link['href'].endswith('pdf'): break
    else:
        print '"Latest reports" PDF not found'
        break
    print '"Latest reports" PDF points at', link['href']
    p = requests.get(ofstedbase+link['href'])
    print p.content
    break

The problem is that p contains another web page and not the pdf it should. Is there some way to get the actual pdf?


Update:

Got it to work with one more iteration of BeautifulSoup

 souppage = BeautifulSoup(p.text)
 line = souppage.findAll('a',text=re.compile("requested"))[0]
 pdf = requests.get(ofstedbase+line['href'])

Any better/nicer solutions gratefully received.

解决方案

Got it to work with one more iteration of BeautifulSoup

 souppage = BeautifulSoup(p.text)
 line = souppage.findAll('a',text=re.compile("requested"))[0]
 pdf = requests.get(ofstedbase+line['href'])

这篇关于如何获得一个JS重定向的PDF从网页链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆