如何获得一个JS重定向的PDF从网页链接 [英] How to get a JS redirected pdf linked from a web page
问题描述
我使用要求
来获取网页,举例如下。
进口要求
从BS4进口BeautifulSoup
URL =http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/CARE/EY298883
R = requests.get(URL)
汤= BeautifulSoup(r.text)
有关这些网页我想获得第一的PDF被点到标题为最新报道一节中的每一个。你怎么能与美丽的汤做到这一点?
在HTML中的相关部分。
<&TBODY GT;
&所述; TR>
百分位范围=山口>最新报道< /第i
百分位范围=山口级=日期>检验LT; BR />日期和LT; /第i
百分位范围=山口级=日期>首先< BR />公布< BR />日期和LT; /第i
< / TR>
&所述; TR>
< TD>< A HREF =/供应商/文件/ 1266031 /金塔/ 106428.pdf><跨度类=图标PDF> PDF< / SPAN>早年检验报告< / A>< / TD>
< TD类=日期> 2009年&LT 3月12日; / TD>
< TD类=日期> 2009年&LT 4月4日; / TD>
< / TR> < / TBODY>
以下code看起来像它应该工作,但没有。
ofstedbase =http://www.ofsted.gov.uk
在soup.findAll('日')col_header:
如果不是col_header.contents [0] ==最新报道:继续
在col_header.parent.parent.findAll('A')链接:
如果link.attrs和链路['的href']'href属性的endsWith(PDF):突破
其他:
打印'最新报道PDF没有找到'
打破
打印'最新报道在PDF点',链接['href属性]
P = requests.get(ofstedbase +链接['的href'])
打印p.content
打破
块引用>问题是,
P
包含了另一个网页,而不是PDF它应该。是否有某种方式来获得实际的PDF?更新:
明白了与BeautifulSoup的又一个迭代的工作。
souppage = BeautifulSoup(p.text)
行= souppage.findAll('A',文本= re.compile(要求))[0]
PDF = requests.get(ofstedbase +线['HREF'])任何更好的/更好的解决方案,感激地接受。
解决方案得到它与BeautifulSoup的又一个迭代工作
souppage = BeautifulSoup(p.text)
行= souppage.findAll('A',文本= re.compile(要求))[0]
PDF = requests.get(ofstedbase +线['HREF'])I am using
requests
to get web pages, for example as follows.import requests from bs4 import BeautifulSoup url = "http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/CARE/EY298883" r = requests.get(url) soup = BeautifulSoup(r.text)
For each one of these pages I would like to get the first pdf that is point to in the section titled "Latest reports". How can you do this with beautiful soup?
The relevant part of the HTML is
<tbody> <tr> <th scope="col">Latest reports</th> <th scope="col" class="date">Inspection <br/>date</th> <th scope="col" class="date">First<br/>publication<br/>date</th> </tr> <tr> <td><a href="/provider/files/1266031/urn/106428.pdf"><span class="icon pdf">pdf</span> Early years inspection report </a></td> <td class="date">12 Mar 2009</td> <td class="date">4 Apr 2009</td> </tr> </tbody>
The following code looks like it should work but does not.
ofstedbase = "http://www.ofsted.gov.uk" for col_header in soup.findAll('th'): if not col_header.contents[0] == "Latest reports": continue for link in col_header.parent.parent.findAll('a'): if 'href' in link.attrs and link['href'].endswith('pdf'): break else: print '"Latest reports" PDF not found' break print '"Latest reports" PDF points at', link['href'] p = requests.get(ofstedbase+link['href']) print p.content break
The problem is that
p
contains another web page and not the pdf it should. Is there some way to get the actual pdf?
Update:
Got it to work with one more iteration of BeautifulSoup
souppage = BeautifulSoup(p.text) line = souppage.findAll('a',text=re.compile("requested"))[0] pdf = requests.get(ofstedbase+line['href'])
Any better/nicer solutions gratefully received.
解决方案Got it to work with one more iteration of BeautifulSoup
souppage = BeautifulSoup(p.text) line = souppage.findAll('a',text=re.compile("requested"))[0] pdf = requests.get(ofstedbase+line['href'])
这篇关于如何获得一个JS重定向的PDF从网页链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!