使用python 3.6和beautifulsoup进行网络抓取-获取无效的URL [英] web-scraping with python 3.6 and beautifulsoup - getting Invalid URL
问题描述
我想在Python中使用此页面: http ://www.sothebys.com/zh-CN/search-results.html?keyword = degas%27
I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27
这是我的代码:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
我得到以下输出:
<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>
我可以使用同一台计算机上的浏览器打开页面,但不会收到任何错误消息.当我将相同的代码与另一个URL一起使用时,将获取正确的HTML内容:
I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
我还测试了其他URL(reddit,google,电子商务网站),没有遇到任何问题.因此,相同的代码适用于一个URL,而不适用于另一个URL.问题出在哪里?
I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?
推荐答案
此网站阻止了不是来自任何浏览器的请求,因此出现了Invalid URL
错误.将自定义标头添加到请求中,效果很好.
This website blocks the requests not coming from any browser thus you get the Invalid URL
error. Adding custom headers to the request works fine.
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
这篇关于使用python 3.6和beautifulsoup进行网络抓取-获取无效的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!