从urlReq(url)中删除'urllib.error.HTTPError:HTTP Error 302:' [英] Remove 'urllib.error.HTTPError: HTTP Error 302:' from urlReq(url)
问题描述
大家好吗? :)
我正在尝试使用某些网址参数来抓取一个网站。
如果我正确使用 url1,url2,url3 ,它会正确执行 WORKS ,并打印出我想要的常规输出(html)->
Hey guys what's up? :)
I'm trying to scrape a website with some url parameters.
If I use url1, url2, url3 it WORKS properly and it prints me the regular output I want (html) ->
import bs4
from urllib.request import urlopen as urlReq
from bs4 import BeautifulSoup as soup
# create urls
url1 = 'https://en.titolo.ch/sale'
url2 = 'https://en.titolo.ch/sale?limit=108'
url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'
url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
# opening up connection on each url, grabbing the page
uClient = urlReq(url4)
page_html = uClient.read()
uClient.close()
# parsing the downloaded html
page_soup = soup(page_html, "html.parser")
# print the html
print(page_soup.body.prettify())
->但是,当我尝试 url4 url4 ='https://en.titolo.ch/sale?category_styles=31066&limit=108'
下面的错误。我究竟做错了什么?
-也许与Cookie有关? ->但是为什么它可以在其他网址上使用呢?
-也许它们只是阻止抓取尝试?
-如何使用< URL中的strong>多个参数?
-> BUT when I try "url4" url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
It gives me the Error below. What am I doing wrong?
- Maybe it has something to do with cookies? -> But why does it work on the other urls...
- Maybe they are just blocking the scrape attempt?
- How can I avoid this error with using multiple Parameters in the URL?
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Temporarily
我在这里先向您的帮助表示感谢!
干杯
艾伦
Thanks for the help in advance! Cheers Alan
我已经尝试过的东西:
我尝试了请求lib
What I have already tried: I tried the requests lib
import requests
url = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
r = requests.get(url)
html = r.text
print(html)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /sale
on this server.</p>
</body></html>
[Finished in 0.375s]
完整的错误消息来自urllib请求:
Traceback (most recent call last):
File "C:\Users\jedi\Documents\non\of\your\business\smile\stackoverflow_question", line 12, in <module>
uClient = urlReq(url4)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\jedi\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 745, in http_error_302
self.inf_msg + msg, headers, fp)
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Temporarily
[Finished in 2.82s]
推荐答案
如果使用 requests
包并在标题中添加用户代理,则看起来它正在获得<$对所有4个链接的c $ c> 200 响应。因此,请尝试添加用户代理标头:
If use requests
package and add in the user agent in the headers, it looks like it's getting 200
response for all 4 of those links. So try adding in the user agent headers:
headers = {'User-Agent':'Mozilla / 5.0(Windows NT 10.0; Win64; x64)AppleWebKit / 537.36( KHTML,例如Gecko)Chrome / 72.0.3626.121 Safari / 537.36'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
import requests
from bs4 import BeautifulSoup as soup
# create urls
url1 = 'https://en.titolo.ch/sale'
url2 = 'https://en.titolo.ch/sale?limit=108'
url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'
url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
url_list = [url1, url2, url3, url4]
for url in url_list:
# opening up connection on each url, grabbing the page
response = requests.get(url, headers=headers)
print (response.status_code)
输出:
200
200
200
200
因此:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
url = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
r = requests.get(url, headers=headers)
html = r.text
print(html)
这篇关于从urlReq(url)中删除'urllib.error.HTTPError:HTTP Error 302:'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!