使用Python的Web爬网给出HTTP错误404:找不到 [英] Web Scraping using Python giving HTTP Error 404: Not Found
问题描述
我是Python的新手,并不擅长于此.我正在尝试从名为Transfermarkt的网站(我是足球迷)进行网络抓取,但是当我尝试提取数据时,它给了我HTTP错误404.这是我的代码:
I am brand new to Python and have not very good at it. I am trying to web scrape from a website called Transfermarkt (I'm a big football fan) but its giving me HTTP Error 404 when I try extract data. Here is my code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
for che in chelsea:
player = che.tbody.tr.td.table.tbody.tr.td["spielprofil_tooltip tooltipstered"]
print("player: " +player)
错误提示:
Error says:
Traceback (most recent call last):
File "C:\Users\x15476582\Desktop\WebScrape.py", line 12, in <module>
uClient = uReq(my_url)
File "C:\Python36-32\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python36-32\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36-32\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36-32\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Python36-32\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36-32\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
任何帮助,我们将不胜感激,谢谢你们x
Any help would be greatly appreciated, thanks guys x
推荐答案
如上面提到的Rup,您的用户代理可能已被服务器拒绝.
As Rup mentioned above, your user agent may have been rejected by the server.
尝试使用以下内容扩展代码:
Try augmenting your code with the following:
import urllib.request # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup
my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"
# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}
# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)
# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
page_html = response.read()
在上面的代码之后,您可以继续进行分析. Python文档中有一些关于此主题的有用页面:
After the code above you can continue your analysis. The Python docs have some useful pages on this topic:
https://docs.python.org/3/library/urllib.request.html#examples
https://docs.python.org/3/library/urllib .request.html
Mozilla的文档包含大量用户代理字符串,供您尝试:
Mozilla's documentation has a load of user-agent strings to try:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
这篇关于使用Python的Web爬网给出HTTP错误404:找不到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!