python urllib2 - 在抓取之前等待页面完成加载/重定向? [英] python urllib2 - wait for page to finish loading/redirecting before scraping?

查看:35
本文介绍了python urllib2 - 在抓取之前等待页面完成加载/重定向?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习制作网络抓取工具,并想为个人项目抓取 TripAdvisor,使用 urllib2 抓取 html.但是,我遇到了一个问题,使用下面的代码,我返回的 html 不正确,因为页面似乎需要一秒钟的时间来重定向(您可以通过访问 url 来验证这一点) - 相反,我得到了代码从最初短暂出现的页面.

是否有一些行为或参数需要设置以确保在获取网站内容之前页面已完全完成加载/重定向?

导入 urllib2从 bs4 导入 BeautifulSoupbostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")汤 = BeautifulSoup(bostonPage)打印汤.美化()

答案是彻底的,但是,最终解决我的问题的是:https://stackoverflow.com/a/3210737/1157283

解决方案

重新定位问题不是重定向是页面使用 javascript 修改内容,但是 urllib2 没有 JS 引擎它只是 GETS 数据,如果您在浏览器上禁用了 javascript,您会注意到它加载的内容与 urllib2 返回的内容基本相同

导入 urllib2从 BeautifulSoup 导入 BeautifulSoupbostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")汤 = BeautifulSoup(bostonPage)open('test.html', 'w').write(soup.read())

test.html 并在浏览器中禁用 JS,在 firefox 内容中最简单 -> 取消选中启用 javascript,生成相同的结果集.

那有什么好做的呢,首先要检查网站是否提供API,报废往往会皱眉http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

旅行/酒店 API?看起来他们可能会,但有一些限制.

但是如果我们仍然需要刮它,用JS,那么我们可以使用selenium http://seleniumhq.org/ 它主要用于测试,但它很容易并且有相当好的文档.

我还发现了这个启用Javascript的抓取网站?和这个http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

希望有所帮助.

附注:

<预><代码>>>>导入 urllib2>>>从 bs4 导入 BeautifulSoup>>>>>>bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")>>>值 = bostonPage.read()>>>汤 = BeautifulSoup(value)>>>open('test.html', 'w').write(value)

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2. However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.

Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?

import urllib2
from bs4 import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()

Edit: The answer is thorough, however, in the end what solved my problem was this: https://stackoverflow.com/a/3210737/1157283

解决方案

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

Travel/Hotel API's? it looks they might, though with some restrictions.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope that helps.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

这篇关于python urllib2 - 在抓取之前等待页面完成加载/重定向?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆