使用python 3.6和beautifulsoup进行网络抓取-获取无效的URL [英] web-scraping with python 3.6 and beautifulsoup - getting Invalid URL

查看:153
本文介绍了使用python 3.6和beautifulsoup进行网络抓取-获取无效的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Python中使用此页面: http ://www.sothebys.com/zh-CN/search-results.html?keyword = degas%27

I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27

这是我的代码:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

我得到以下输出:

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>

我可以使用同一台计算机上的浏览器打开页面,但不会收到任何错误消息.当我将相同的代码与另一个URL一起使用时,将获取正确的HTML内容:

I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

我还测试了其他URL(reddit,google,电子商务网站),没有遇到任何问题.因此,相同的代码适用于一个URL,而不适用于另一个URL.问题出在哪里?

I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?

推荐答案

此网站阻止了不是来自任何浏览器的请求,因此出现了Invalid URL错误.将自定义标头添加到请求中,效果很好.

This website blocks the requests not coming from any browser thus you get the Invalid URL error. Adding custom headers to the request works fine.

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)

这篇关于使用python 3.6和beautifulsoup进行网络抓取-获取无效的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆