preventing A＆QUOT;隐藏＆QUOT;在Python中的urlopen（）重定向 [英] Preventing a "hidden" redirect with urlopen() in Python

查看：178 发布时间：2016/8/5 19:05:10 python beautifulsoup urllib urlopen

本文介绍了preventing A＆QUOT;隐藏＆QUOT;在Python中的urlopen（）重定向的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的 BeautifulSoup 作为网页抓取和我使用的时候有一个特定类型的网站问题的的urlopen 。网站上的每个项目都有自己独特的网页，而该项目有不同的格式（的例如：500毫升，1L，2L，... 的）。

当我打开产品的URL（ www.example.com/product1 的）用我的互联网浏览器，我会看到500毫升格式的图片，它（信息的价格，数量，香料等。的），以及所有可用于该特定项目的其他格式的列表。如果其他格式点击（例如：1L 的），照片以及有关该项目的信息会改变，但在我的浏览器顶部的URL将保持不变（ www.example。 COM /产品1 的）。但是，我知道通过检查，所有的格式，有自己独特的URL（ 500毫升页面的HTML code：www.example.com/product1/123; 1L：www.example.com/product1 / 456，... 的）。当使用我的互联网浏览器的1L格式的唯一的URL，我自动重定向到页面的 www.example.com/product1 的但画面和页面上显示的信息对应于1L格式。该HTML code也包含了我需要了解1L格式的信息。

当我用我的问题就出现了的urlopen 打开这些唯一URL。

 从BS4进口BeautifulSoup
从进口的urllib的urlopen
网页=的urlopen（'www.example.com/product1/456'）
汤= BeautifulSoup（网页）
打印汤

中包含的信息汤的不对应信息使用我的互联网浏览器的唯一的URL显示出来： www.example.com/product1/456 的。它给了我关于在默认情况下的 www.example.com/product1 的显示项目格式，它始终是500毫升格式的信息。

有什么办法，我可以prevent这种重定向，让我与BeautifulSoup捕捉所含的唯一URL的HTML code中的信息？

解决方案

 进口的urllib2类RedirectHandler（urllib2.HTT predirectHandler）：
    高清的http_error_302（个体经营，REQ，FP，code，味精，标题）：
        结果= urllib2.HTTPError（req.get_full_url（），code，味精，头，FP）
        result.status = code
        返回结果
    http_error_301 = http_error_303 = http_error_307 =的http_error_302首战= urllib2.build_opener（RedirectHandler（））
网页= opener.open（'http://www.example.com/product1/456'）
...

I am using BeautifulSoup for web scraping and I am having problems with a particular type of website when using urlopen. Every item on the website has its own unique page and the item comes in different formats (ex: 500 mL, 1L, 2L,...).

When I open the URL of the product (www.example.com/product1) using my Internet Browser, I would see a picture of the 500 mL format, information about it (price, quantity, flavor, etc.) and a list of all the other formats available for this specific item. If a click on another format (ex: 1L), the picture and the information about the item would change but the URL at the top of my browser would stay the same (www.example.com/product1). However, I know by inspecting the HTML code of the page that all the format have their own unique URL (500 mL : www.example.com/product1/123; 1L : www.example.com/product1/456, ...). When using the unique URL of the 1L format in my Internet Browser, I am automatically redirected to the page www.example.com/product1 but the picture and the information displayed on the page corresponds to the 1L format. The HTML code also contains the information that I need about the 1L format.

My problem arises when I use urlopen to open these unique URLs.

from bs4 import BeautifulSoup 
from urllib import urlopen
webpage = urlopen('www.example.com/product1/456')
soup=BeautifulSoup(webpage)
print soup

The information contained in the soup does not correspond to the information displayed using my Internet Browser for the unique URL: www.example.com/product1/456. It gives me the information about the item format displayed by default on www.example.com/product1 which is always the 500 mL format.

Is there any way I can prevent this redirection that would allow me to capture with BeautifulSoup the information contained in the HTML code of the unique URLs?

解决方案

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result
    http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://www.example.com/product1/456')
...

这篇关于preventing A＆QUOT;隐藏＆QUOT;在Python中的urlopen（）重定向的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

preventing A＆QUOT;隐藏＆QUOT;在Python中的urlopen（）重定向 [英] Preventing a "hidden" redirect with urlopen() in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

preventing A＆QUOT;隐藏＆QUOT;在Python中的urlopen（）重定向 [英] Preventing a &quot;hidden&quot; redirect with urlopen() in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

preventing A＆QUOT;隐藏＆QUOT;在Python中的urlopen（）重定向 [英] Preventing a "hidden" redirect with urlopen() in Python

登录关闭