preventing A"隐藏"在Python中的urlopen()重定向 [英] Preventing a "hidden" redirect with urlopen() in Python

查看:178
本文介绍了preventing A"隐藏"在Python中的urlopen()重定向的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的 BeautifulSoup 作为网页抓取和我使用的时候有一个特定类型的网站问题的的urlopen 。网站上的每个项目都有自己独特的网页,而该项目有不同的格式(的例如:500毫升,1L,2L,... 的)。

当我打开产品的URL( www.example.com/product1 的)用我的互联网浏览器,我会看到500毫升格式的图片,它(信息的价格,数量,香料等。的),以及所有可用于该特定项目的其他格式的列表。如果其他格式点击(例如:1L 的),照片以及有关该项目的信息会改变,但在我的浏览器顶部的URL将保持不变( www.example。 COM /产品1 的)。但是,我知道通过检查,所有的格式,有自己独特的URL( 500毫升页面的HTML code:www.example.com/product1/123; 1L:www.example.com/product1 / 456,... 的)。当使用我的互联网浏览器的1L格式的唯一的URL,我自动重定向到页面的 www.example.com/product1 的但画面和页面上显示的信息对应于1L格式。该HTML code也包含了我需要了解1L格式的信息。

当我用我的问题就出现了的urlopen 打开这些唯一URL。

 从BS4进口BeautifulSoup
从进口的urllib的urlopen
网页=的urlopen('www.example.com/product1/456')
汤= BeautifulSoup(网页)
打印汤

中包含的信息汤的对应​​信息使用我的互联网浏览器的唯一的URL显示出来: www.example.com/product1/456 的。它给了我关于在默认情况下的 www.example.com/product1 的显示项目格式,它始终是500毫升格式的信息。

有什么办法,我可以prevent这种重定向,让我与BeautifulSoup捕捉所含的唯一URL的HTML code中的信息?


解决方案

 进口的urllib2类RedirectHandler(urllib2.HTT predirectHandler):
    高清的http_error_302(个体经营,REQ,FP,code,味精,标题):
        结果= urllib2.HTTPError(req.get_full_url(),code,味精,头,FP)
        result.status = code
        返回结果
    http_error_301 = http_error_303 = http_error_307 =的http_error_302首战= urllib2.build_opener(RedirectHandler())
网页= opener.open('http://www.example.com/product1/456')
...

I am using BeautifulSoup for web scraping and I am having problems with a particular type of website when using urlopen. Every item on the website has its own unique page and the item comes in different formats (ex: 500 mL, 1L, 2L,...).

When I open the URL of the product (www.example.com/product1) using my Internet Browser, I would see a picture of the 500 mL format, information about it (price, quantity, flavor, etc.) and a list of all the other formats available for this specific item. If a click on another format (ex: 1L), the picture and the information about the item would change but the URL at the top of my browser would stay the same (www.example.com/product1). However, I know by inspecting the HTML code of the page that all the format have their own unique URL (500 mL : www.example.com/product1/123; 1L : www.example.com/product1/456, ...). When using the unique URL of the 1L format in my Internet Browser, I am automatically redirected to the page www.example.com/product1 but the picture and the information displayed on the page corresponds to the 1L format. The HTML code also contains the information that I need about the 1L format.

My problem arises when I use urlopen to open these unique URLs.

from bs4 import BeautifulSoup 
from urllib import urlopen
webpage = urlopen('www.example.com/product1/456')
soup=BeautifulSoup(webpage)
print soup    

The information contained in the soup does not correspond to the information displayed using my Internet Browser for the unique URL: www.example.com/product1/456. It gives me the information about the item format displayed by default on www.example.com/product1 which is always the 500 mL format.

Is there any way I can prevent this redirection that would allow me to capture with BeautifulSoup the information contained in the HTML code of the unique URLs?

解决方案

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result
    http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://www.example.com/product1/456')
...

这篇关于preventing A"隐藏"在Python中的urlopen()重定向的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆